Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Abhishek Nandy and Manisha BiswasReinforcement Learning https://doi.org/10.1007/978-1-4842-3285-9_2

2. RL Theory and Algorithms

Abhishek Nandy¹ and Manisha Biswas²

(1)

Rm HIG L-2/4, Bldg Swaranika Co-Opt HSG, Kolkata, West Bengal, India

(2)

North 24 Parganas, West Bengal, India

This chapter covers how Reinforcement Learning works and explains the concepts behind it, including the different algorithms that form the basis of Reinforcement Learning.

The chapter explains these algorithms, but to start with, you will learn why Reinforcement Learning can be hard and see some different scenarios. The chapter also covers different ways that Reinforcement Learning can be implemented.

Along the way, the chapter formulates the Markov Decision Process (MDP) and describes it. The chapter also covers SARSA and touches on temporal differences. Then, the chapter touches on Q Learning and dynamic programming.

Theoretical Basis of Reinforcement Learning

This section touches on the theoretical basis of Reinforcement Learning. Figure 2-1 shows how you are going to implement MDP, which is described later.

Figure 2-1.

Theoretical basis of MDP

Environments in Reinforcement Learning are represented by the Markov Decision Process (discussed later in this chapter).

SS is a finite set of states. AA is a finite set of actions.
T:S×A×S→[0,1]T:S×A×S→[0,1] is a transition model that maps (state, action, state) triples to probabilities.
T(s,a,s′)T(s,a,s′) is the probability that you’ll land in state s′s′ if you were in state ss and took action aa.

In terms of conditional probabilities , the following is true:

T(s,a,s′)=P(s′|s,a)T(s,a,s′)=P(s′|s,a)

R:S×S→RR:S×S→R is a reward function that gives a real number that represents the amount of reward (or punishment) the environment will grant for a state transition. R(s,s′)R(s,s′) is the reward received after transitioning to state s′s′ from state ss .

If the transition model is known to the agent, i.e., the agent knows where it would probably go from where it stands, it’s fairly easy for the agent to know how to act in a way that maximizes its expected utility from its experience with the environment.

We can define the expected utility for the agent to be the accumulated rewards it gets throughout its experience with the environment. If the agent goes through the states s0,s1,…,sn−1,sns0,s1,…,sn−1,sn, you could formally define its expected utility as follows:

∑nt=1γtE[R(st−1,st)]∑t=1nγtE[R(st−1,st)]

where γγ is a discount factor used to decrease the values (and hence the importance) of past rewards, and EE is the expected value.

The problem arises when the agents have no clue about the probabilistic model behind the transitions, and this where RL comes in. The RL problem can formally be defined now as the problem of learning a set of parameters in order to maximize the expected utility.

RL comes in two flavors:

Model-based : The agent attempts to sample and learn the probabilistic model and use it to determine the best actions it can take. In this flavor, the set of parameters that was vaguely referred to is the MDP model.
Model-free : The agent doesn’t bother with the MDP model and instead attempts to develop a control function that looks at the state and decides the best action to take. In that case, the parameters to be learned are the ones that define the control function.

Where Reinforcement Learning Is Used

This section discusses the different fields of Reinforcement Learning, as shown in Figure 2-2.

Figure 2-2.

Different fields of Reinforcement Learning

Manufacturing

In manufacturing, factory robots use Reinforcement Learning to move an object from one box and then keep it in another container.

If it fails or finds success upon delivering, the robot remembers the object and learns again, with the end result to get the best results with the greatest accuracy.

Inventory Management

In terms of inventory management, Reinforcement Learning can be used to reduce transit time in stocking and can be applied to placing products in warehouses for utilizing space optimally.

Delivery Management

Reinforcement Learning is applied to solve the problem of split delivery vehicle routing. Q Learning is used to serve appropriate customers with one vehicle.

Finance Sector

Reinforcement Learning is being used for accounting, using trading strategies.

Why Is Reinforcement Learning Difficult?

One of the toughest parts of Reinforcement Learning is having to map the environment and include all possible moves. For example, consider a board game.

You have to apply artificial intelligence to what is learned. In theory, Reinforcement Learning should work perfectly because there are a lot of state jumps and complex moves in a board game. However, applying Reinforcement Learning by itself becomes difficult.

To get the best results, we apply a rule-based engine with Reinforcement Learning. If we don’t apply a rule-based engine, there are so many options in board games that the agent will take forever to discover the path.

First of all, we apply simple rules so that the AI learns quickly and then, as the complexity increases, we apply Reinforcement Learning.

Figure 2-3 shows how applying Reinforcement Learning can be difficult.

Figure 2-3.

Reinforcement Learning with rules

Preparing the Machine

Before you can run the examples, you need to perform certain steps to install the software. The examples in this book use the Anaconda version of Python, so this section explains how to find and download it. First, you need to open a terminal . The process of starting the terminal is shown in Figure 2-4.

Figure 2-4.

Opening the terminal

Next, you need to update the packages. Write the following command in the terminal to do so. See Figure 2-5.

sudo apt-get update

Figure 2-5.

Updating the packages

After you run the update command, the required installation content is installed, as shown in Figure 2-6.

Figure 2-6.

Everything has been updated

Now you can use another command for installing the required packages. Figure 2-7 shows the process.

sudo apt-get install golang python3-dev python-dev libcupti-dev libjpeg-turbo8-dev make tmux htop chromium-browser git cmake zlib1g-dev libjpeg-dev xvfb libav-tools xorg-dev python-opengl libboost-all-dev libsdl2-dev swig.

Figure 2-7.

Fetching the updates

As shown in Figure 2-8, you’ll need to type y and then press Enter to continue.

Figure 2-8.

Continue with the installation

In the next step, the essential packages are downloaded and updated accordingly, as shown in Figure 2-9.

Figure 2-9.

Downloading and extracting the packages

You have now installed the Anaconda distribution of Python. Next, you need to open a browser window for Ubuntu. This example shows Mozilla Firefox. Search for the Anaconda installation, as shown in Figure 2-10.

Figure 2-10.

Downloading Anaconda

Now you have to find the download that’s appropriate for your particular operating system. The Anaconda page is shown in Figure 2-11.

Figure 2-11.

Anaconda page

Select the appropriate distribution of Anaconda, as shown in Figure 2-12.

Figure 2-12.

Selecting the Anaconda version

Save the file next, as shown in Figure 2-13.

Figure 2-13.

Saving the file

Now, using the terminal, you have to get inside the downloads folder. You should also check for the file that was being saved. See Figure 2-14.

Figure 2-14.

Getting inside the downloads folder

You now have to use the bash command to run the shell script (see Figure 2-15):

bash Anaconda3-4.4.0-Linux-x86_64.sh

Figure 2-15.

Running the shell script

To select the platform, type yes and press Enter. Anaconda will be installed into the home location, as shown in Figure 2-16.

Figure 2-16.

Setting up the Anaconda environment

The next step, shown in Figure 2-17, will install all the important packages for Anaconda so that it is configured properly.

Figure 2-17.

Installing the key packages for Anaconda

After the Anaconda installation is complete, you need to open a new terminal to set up your Anaconda environment . You have to create a new environment for Anaconda using the conda create command (see Figure 2-18).

This command keeps all the packages in an isolated place.

conda create --name universe python=3.6 anaconda

Figure 2-18.

Creating an environment

In the next step, the Anaconda environment will install the necessary packages. See Figure 2-19.

Figure 2-19.

The packages for installing or updating Anaconda

Type y and then press Enter to continue. Then the entire process will be complete after every package is updated in the environment. You can now activate the environment. See Figure 2-20.

Figure 2-20.

The packages for installing or updating Anaconda

Some additional updates might need to be installed. You also need to install Swig, as shown in Figure 2-21.

conda install pip six libgcc swig

Figure 2-21.

Installing Swig too

You will also have to install OpenCV in order to update certain packages, as shown in Figure 2-22.

Figure 2-22.

Installing OpenCV

If there are updates to OpenCV, type y to install them too. See Figure 2-23.

Figure 2-23.

Installing OpenCV

Next, you need to install TensorFlow. This chapter shows how to install the CPU version. See Figure 2-24.

pip install --upgrade tensorflow

Figure 2-24.

Installing TensorFlow

Figure 2-25 shows the packages being installed for TensorFlow.

Figure 2-25.

TensorFlow installs the packages

The next step, shown in Figure 2-26, asks for the privileges to install the other packages. Type y to continue.

Figure 2-26.

Package installation happens

In the next section, we install Docker. We will first learn what Docker is.

Installing Docker

When you want to keep your containers in the cloud, Docker is the best option. Developers generally use Docker to minimize workloads on a single machine, because the entire architecture can be hosted on the developer environment. Enterprises use Docker to maintain an agile environment. Operators generally use Docker to keep an eye on apps and to run and manage them effectively.

Now you will install Docker, as it is essential for OpenAI Gym and Universe to work. You need to install Docker because, when you are training an environment, Docker is very responsive to simulations since it runs with low resources.

The command to be entered in the terminal is shown here:

$ sudo apt-get install

apt-transport-https

ca-certificates

curl

software-properties-common

The next command to enter is:

$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

You use curl and the http link so that Docker can access these trusted key values.

Now download the Docker type using this command:

$ sudo add-apt-repository

"deb [arch=amd64] https://download.docker.com/linux/ubuntu

$(lsb_release -cs)

stable"

Type this command to update Docker, as shown in Figure 2-27:

$ sudo apt-get update

Figure 2-27.

Updating the package

Type this command to install Docker , as shown in Figure 2-28:

$ sudo apt-get install docker-ce

Figure 2-28.

Docker installation

To test Docker, use this command (see Figure 2-29):

$ sudo service docker start

$ sudo docker run hello-world

Figure 2-29.

Testing docker

An Example of Reinforcement Learning with Python

This section goes through an example of Reinforcement Learning and explains the flow of the algorithm. You’ll see how Reinforcement Learning can be applied. This section uses an open source GitHub repo that has a very good example of Reinforcement Learning. You will need to clone it to work with it.

The GitHub repo link is https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow . Within the Ubuntu module, get inside the terminal and start cloning the repo, as shown in Figure 2-30.

Figure 2-30.

Cloning the repo

Figure 2-31 shows how the repo is replicated.

Figure 2-31.

Replication of the repo

You will next get inside the folder that you used, as shown in Figure 2-32.

Figure 2-32.

Getting inside the folder

We are working with a scenario of Reinforcement Learning where we are applying the letter O as a wanderer. That wanderer wants to get the treasure T as fast as it can.

The condition looks like this:

O-----T

The wanderer tries to find the quickest path to reach the treasure. During each episode, the steps the wanderer takes to reach the treasure are counted. With each episode, the condition improves and the number of steps declines.

Here are some of the basic steps in terms of Reinforcement Learning:

The program tries to work with actions, as actions are very important in terms of Reinforcement Learning.
The available actions for this wanderer is moving left or right:

ACTIONS = ['left','right']
The wanderer can be considered the agent.
The number of states (also called the number of steps) is limited to 6 in this example:

N_States = 6;

Now you need to apply hyperparameters for Reinforcement Learning.

What Are Hyperparameters ?

Hyperparameters are variables that were set before setting the model’s parameters. Generally, they are different from the parameters of the model for the underlying system under analysis.

We introduce epsilon, alpha, and gamma.

Epsilon is the greedy factor
Alpha is the learning rate
Gamma is the discount factor

The maximum number of episodes in this case is 13. The refresh rate is when the scenario is refreshed.

Writing the Code

To create the process from which the computer learns, we have to formulate a table. This process is known as Q Learning and the table is called a Q table (You will learn more about Q Learning in the next chapter.) All the key elements are stored in the Q table and all the decisions are made based on the Q table .

def build_q_table(n_states, actions):

table = pd.DataFrame(

np.zeros((n_states, len(actions))), # q_table initial values

columns=actions, # actions's name

)

# print(table) # show table

return table

Now we have to take actions. To do so, we use this code:

def choose_action(state, q_table):

# This is how to choose an action

state_actions = q_table.iloc[state, :]

if (np.random.uniform() > EPSILON) or (state_actions.all() == 0): # act non-greedy or state-action have no value

action_name = np.random.choice(ACTIONS)

else: # act greedy

action_name = state_actions.argmax()

return action_name

Now we create the environment and determine how the agents will work within the environment:

def get_env_feedback(S, A):

# This is how the agent will interact with the environment

if A == 'right': # move right

if S == N_STATES - 2: # terminate

S_ = 'terminal'

R = 1

else:

S_ = S + 1

R = 0

else: # move left

R = 0

if S == 0:

S_ = S # reach the wall

else:

S_ = S - 1

return S_, R

This function prints the wanderer and treasure hunt conditions :

def update_env(S, episode, step_counter):

# This is how the environment be updated

env_list = ['-']*(N_STATES-1) + ['T'] # '---------T' our environment

if S == 'terminal':

interaction = 'Episode %s: total_steps = %s' % (episode+1, step_counter)

print(' {}'.format(interaction), end='')

time.sleep(2)

print(' ', end='')

else:

env_list[S] = 'o'

interaction = ''.join(env_list)

print(' {}'.format(interaction), end='')

time.sleep(FRESH_TIME)

The rl() method calls the Q Learning scenario, which we discuss in next chapter:

def rl():

# main part of RL loop

q_table = build_q_table(N_STATES, ACTIONS)

for episode in range(MAX_EPISODES):

step_counter = 0

S = 0

is_terminated = False

update_env(S, episode, step_counter)

while not is_terminated:

A = choose_action(S, q_table)

S_, R = get_env_feedback(S, A) # take action & get next state and reward

q_predict = q_table.ix[S, A]

if S_ != 'terminal':

q_target = R + GAMMA * q_table.iloc[S_, :].max() # next state is not terminal

else:

q_target = R # next state is terminal

is_terminated = True # terminate this episode

q_table.ix[S, A] += ALPHA * (q_target - q_predict) # update

S = S_ # move to next state

update_env(S, episode, step_counter+1)

step_counter += 1

return q_table

if __name__ == "__main__":

q_table = rl()

print(' Q-table: ')

print(q_table)

The full code looks like this:

import numpy as np

import pandas as pd

import time

np.random.seed(2) # reproducible

N_STATES = 6 # the length of the 1 dimensional world

ACTIONS = ['left', 'right'] # available actions

EPSILON = 0.9 # greedy police

ALPHA = 0.1 # learning rate

GAMMA = 0.9 # discount factor

MAX_EPISODES = 13 # maximum episodes

FRESH_TIME = 0.3 # fresh time for one move

def build_q_table(n_states, actions):

table = pd.DataFrame(

np.zeros((n_states, len(actions))), # q_table initial values

columns=actions, # actions's name

)

# print(table) # show table

return table

def choose_action(state, q_table):

# This is how to choose an action

state_actions = q_table.iloc[state, :]

if (np.random.uniform() > EPSILON) or (state_actions.all() == 0): # act non-greedy or state-action have no value

action_name = np.random.choice(ACTIONS)

else: # act greedy

action_name = state_actions.argmax()

return action_name

def get_env_feedback(S, A):

# This is how agent will interact with the environment

if A == 'right': # move right

if S == N_STATES - 2: # terminate

S_ = 'terminal'

R = 1

else:

S_ = S + 1

R = 0

else: # move left

R = 0

if S == 0:

S_ = S # reach the wall

else:

S_ = S - 1

return S_, R

def update_env(S, episode, step_counter):

# This is how environment be updated

env_list = ['-']*(N_STATES-1) + ['T'] # '---------T' our environment

if S == 'terminal':

interaction = 'Episode %s: total_steps = %s' % (episode+1, step_counter)

print(' {}'.format(interaction), end='')

time.sleep(2)

print(' ', end='')

else:

env_list[S] = 'o'

interaction = ''.join(env_list)

print(' {}'.format(interaction), end='')

time.sleep(FRESH_TIME)

def rl():

# main part of RL loop

q_table = build_q_table(N_STATES, ACTIONS)

for episode in range(MAX_EPISODES):

step_counter = 0

S = 0

is_terminated = False

update_env(S, episode, step_counter)

while not is_terminated:

A = choose_action(S, q_table)

S_, R = get_env_feedback(S, A) # take action & get next state and reward

q_predict = q_table.ix[S, A]

if S_ != 'terminal':

q_target = R + GAMMA * q_table.iloc[S_, :].max() # next state is not terminal

else:

q_target = R # next state is terminal

is_terminated = True # terminate this episode

q_table.ix[S, A] += ALPHA * (q_target - q_predict) # update

S = S_ # move to next state

update_env(S, episode, step_counter+1)

step_counter += 1

return q_table

if __name__ == "__main__":

q_table = rl()

print(' Q-table: ')

print(q_table)

Let’s now run the program and analyze the output. You need to get inside the cloned GitHub repo and into the required folder, as shown in Figure 2-33.

Figure 2-33.

Getting inside the cloned repo

Now you need to get inside the directory to run the program, as shown in Figure 2-34.

Figure 2-34.

Checking the directory

Now you have to run the program called treasure_on_right.py, which places the treasure to the right of the agent. See Figure 2-35.

Figure 2-35.

Running the Python file

The program is running iterations, as shown in Figure 2-36.

Figure 2-36.

As the iteration happens

As the program and the simulation complete, the final result is interpreted as a Q table, where on each step of completing the cycle, the values reflect how much time it spent in the left and right directions. Figure 2-37 shows the completed Q table.

Figure 2-37.

The Q table created as a result

What Is MDP?

MDP (Markov Decision Process) is a framework that involves creating mathematical formulas and models for decision making where part of it is random and part of it remains in the hands of a decision maker.

MDPs have many different applications , as shown in Figure 2-38.

Figure 2-38.

MDP and its applications

Every state in MDP satisfies the Markov property.

The Markov Property

In the world of Reinforcement Learning, the Markov property refers to a memory-less property that is stochastic. Stochastic means a general mathematical object consisting of random variables. When we are not storing a value of a variable because in each iteration there is a change, we call it stochastic. See Figure 2-39.

Figure 2-39.

The Markov property process

We talk about the Markov Chain in the next section.

The Markov Chain

If a mathematical property has either a discrete state space or a discrete index set, it is known as a Markov Chain. The Markov Chain works in two ways, as shown in Figure 2-40.

Figure 2-40.

Markov Chain

Let’s look at Markov Chains using an example. This example compares sales of Rin detergent versus the other detergents in the market. Assume that sales of Rin is 20 percent of the total detergent sales, which means the rest comprise 80 percent. People who use Rin detergent are defined as A; the others are A¢.

Now we define a rule. Of the people who use Rin detergent, 90% of them continue to use it after a week whereas 10% shift to another brand.

Similarly, 70% of the people who use another detergent shift to Rin after a week, and the rest continue to use the other detergent.

To analyze these conditions, we need a state diagram. See Figure 2-41.

Figure 2-41.

Rin detergent state diagram

In the state diagram, we have created a scenario where the circular points represent states. From this state diagram, we have to assign a transition probability matrix.

The transition probability matrix we get from the state diagram is shown in Figure 2-42.

Figure 2-42.

The transition probability matrix

To determine the use of Rin after two weeks, we have to apply a principle. This principle is common for each and every process you try.

It can be shown as a line connection , as shown in Figure 2-43.

Figure 2-43.

A connected graph

From the origin, we have two paths—one for Rin detergent (through A) and the other for the rest (that is A¢). Here is how the path is created.

1.
From the origin, we create a path for A, so we have to focus on the transition probability matrix.
2.
We trace the path of A.
3.
From the starting market share, the detergent Rin has a market value of 20%.
4.
From the starting point A, we focus on the transition probability matrix.

There is a 90% probability of staying on A, so the other 10% change to the alternate path (to A¢).

Figure 2-44 shows this path calculation graphically .

Figure 2-44.

Path calculation

The total path probability is determined as so: P = .2 *.9 + .8*.7 = .18 + .56 = .74.

This is the percentage of people using Rin after one week.

This formula can also be conceptualized as the current market share (SO) and transition probability (P):

S0 * P = market share after one week

See Figure 2-45.

Figure 2-45.

The matrix created for the next week

The calculation is .2 * .9 + .8*.7 = .74

.2*.1 + .8*.3 =.26

[ .74 .26 ] = S1

Let’s work on a first state matrix. After one week, the sale of Rin detergent is 74% of the market. The other brands then make up 26% of the market.

Now try to find the percentage of people using Rin detergent after two weeks. Figure 2-46 shows the calculation that we need to do after two weeks.

Figure 2-46.

The next transition matrix

So the result is:

A A’

= [ .848 .152 ]

After two weeks, 84.8% of the people will use Rin and 15.2% will use other detergents.

One question you might have is whether the sale of Rin will ever maximize to 100% of the market. As we go along, the matrix will become stationary after a certain number of iterations and finally settle at:

A A’

= [ .75 .25]

After going through the basics of the Markov state and the Markov Chain, it’s time to focus on MDPs again.

MDPs

Almost all Reinforcement Learning problems can be formalized as MDPs. MDPs create a condition that’s prevalent for applying Reinforcement Learning. The essentials of MDPs are a continued Markov process.

A state (St) is Markov if and only if it meets the criteria shown in Figure 2-47.

Figure 2-47.

The Markov state property

The state captures all relevant information from the history. We do not have to retain everything in history because only the previous state determines what will happen now.

For a Markov state (s) and successor state (s’), the state transition probability is defined in Figure 2-48.

Figure 2-48.

The transitive probability

MDP is a Markov reward process with a decision factor in it. It is a type of environment where all the states are Markov.

An MDP is a five tuple < S, A, P, R, Gamma>:

S stands for state
A stands for action
P is a policy
R stands for reward

Policy (π) is a distribution over actions in a given state. A policy is a function or a decision-making process that allows transitions from one state to another .

SARSA

SARSA stands for State Action Reward next State and next Action. It is a different kind of Reinforcement Learning approach and is generally derived from temporal difference learning. We’ll discuss temporal difference learning first.

Temporal Difference Learning

This type of learning is based on its own vicinity or its own range. We generally apply temporal difference learning when we are in a state and want to know what is happening in successive states.

The general idea is that we want to predict the best path over a period of time.

We go from state S0 to state SF. We get rewards in each state. We will be trying to predict the discounted sum of rewards. See Figure 2-49.

Figure 2-49.

State transition

We start by looking at the Markov Chain, as shown in Figure 2-50.

Figure 2-50.

The Markov Chain

Figure 2-51.

The value function

The equation states that the value function maps the state to some number. This number is set to 0 if it is in the final state (see Figure 2-51).

For any state, the value is the expected value of the reward (r) and the discounted value of the ending state.

How SARSA Works

Now we get into SARSA. SARSA is known as an own policy Reinforcement Learning. An own policy means that we can see only our own experiences.

It accumulates updates in one or more steps and learns to update from its experiences.

From the current state, we choose an action and then get to the next state. At the next state, we choose another state and use the current state and the current action with the next state and next action. We then update all the values together as a Q value .

Here is the algorithm:

1.
Initialize Q(s, a) arbitrarily.
2.
Initialize s.
3.
Choose a from s using the policy derived from Q. Repeat these two steps for each episode.
4.
Take action a and observe r and s’.
5.
Choose a’ from s’ using the policy derived from Q (for example, ----E-greedy).

Q(s, a) ß----- Q(s, a) + α[r +γQ(s', a') - Q(s,a)]
Sß---s'; aß-- a';
6.
Repeat these steps for each episode until s is terminal.

Q Learning

Q Learning is a model-free Reinforcement Learning technique. Figure 2-52 illustrates the general procedure for Q Learning.

Figure 2-52.

The Q Learning process

What Is Q?

Q can be stated as a function that consists of two parameters —s and a. The a parameter can also be referred to as a table.

Q represents the value that an action a takes with state s.

Q[s, a] = Immediate reward + discounted reward

The immediate reward is the point given when the agent moves from one state to another while doing an action.

The discounted reward is the point given for future references.

How to Use Q

We generally come up with scenarios where we have to find out where we can utilize the Q table values or the Q value so Q is implemented in this process.

We are looking at what action to take or which policy to implement when we are in state s. We use the Q table to get the best result.

If we are in state s, we need to determine which action is the best. We do not change s, but we go through all the values of a and determine which one is the largest. That will be the action we should take. Mathematically, this idea is represented as shown in Figures 2-53 and 2-54.

Figure 2-53.

The policy equation

Figure 2-54.

How policy works

For MDP, the policy we should implement depends on the current state. We maximize the rewards to get the optimal solution.

SARSA Implementation in Python

Recall that SARSA is as self policy Reinforcement Learning approach.

For example, SARSA can be used to solve a maze. Using the SARSA approach, we cannot compare two different maze environments. We have to stick to one maze and we’ll use the previous as an example. Also, we cannot compare this maze with another outside maze; we have to stick to the maze that we are working on.

The best thing about SARSA is that it can learn from the current state compared to the next state or to subsequent states. We accumulate all the experiences and learn from them.

Let’s break this idea down more. This scenario states that the update can be done on a Q table by comparing the changes in subsequent steps and then making a decision. This idea is illustrated in Figure 2-55.

Figure 2-55.

Updating results using the SARSA table

The learning method in Python is different for SARSA. It looks like this:

def learn(self, s, a, r, s_, a_)

This method depends on the state, the action, the reward, the next state, and the next action.

If we compare the algorithm and convert it to Python, the construct for this equation is shown in Figure 2-56.

Figure 2-56.

The SARSA equation

It’s converted to the following:

q_target = r + self.gamma * self.q_table.ix [s_, a_]

The difference between this equation and Q Learning is the change in this equation:

q_target = r + self.gamma * self.q_table.ix [s_, :].max()

The max() value is present in Q Learning but not in SARSA.

The logic for implementing a policy using SARSA is shown here:

# on-policy

class SarsaTable(RL):

def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9):

super(SarsaTable, self).__init__(actions, learning_rate, reward_decay, e_greedy)

def learn(self, s, a, r, s_, a_):

self.check_state_exist(s_)

q_predict = self.q_table.ix[s, a]

if s_ != 'terminal':

q_target = r + self.gamma * self.q_table.ix[s_, a_] # next state is not terminal

else:

q_target = r # next state is terminal

self.q_table.ix[s, a] += self.lr * (q_target - q_predict) # update

The learning process is somewhat different than with Q Learning. The logic works according to the principle discussed previously .

We combine the state and action of the current status with the next state and next action. This in turn updates the Q table. This is the way the learning works.

def update():

for episode in range(100):

# initial observation

observation = env.reset()

# RL choose action based on observation

action = RL.choose_action(str(observation))

while True:

# fresh env

env.render()

# RL take action and get next observation and reward

observation_, reward, done = env.step(action)

# RL choose action based on next observation

action_ = RL.choose_action(str(observation_))

# RL learn from this transition (s, a, r, s, a) ==> Sarsa

RL.learn(str(observation), action, reward, str(observation_), action_)

# swap observation and action

observation = observation_

action = action_

# break while loop when end of this episode

if done:

break

Here is the code for creating the maze:

import numpy as np

import time

import sys

if sys.version_info.major == 2:

import Tkinter as tk

else:

import tkinter as tk

UNIT = 40 # pixels

MAZE_H = 4 # grid height

MAZE_W = 4 # grid width

class Maze(tk.Tk, object):

def __init__(self):

super(Maze, self).__init__()

self.action_space = ['u', 'd', 'l', 'r']

self.n_actions = len(self.action_space)

self.title('maze')

self.geometry('{0}x{1}'.format(MAZE_H * UNIT, MAZE_H * UNIT))

self._build_maze()

def _build_maze(self):

self.canvas = tk.Canvas(self, bg='white',

height=MAZE_H * UNIT,

width=MAZE_W * UNIT)

# create grids

for c in range(0, MAZE_W * UNIT, UNIT):

x0, y0, x1, y1 = c, 0, c, MAZE_H * UNIT

self.canvas.create_line(x0, y0, x1, y1)

for r in range(0, MAZE_H * UNIT, UNIT):

x0, y0, x1, y1 = 0, r, MAZE_H * UNIT, r

self.canvas.create_line(x0, y0, x1, y1)

# create origin

origin = np.array([20, 20])

# hell

hell1_center = origin + np.array([UNIT * 2, UNIT])

self.hell1 = self.canvas.create_rectangle(

hell1_center[0] - 15, hell1_center[1] - 15,

hell1_center[0] + 15, hell1_center[1] + 15,

fill='black')

# hell

hell2_center = origin + np.array([UNIT, UNIT * 2])

self.hell2 = self.canvas.create_rectangle(

hell2_center[0] - 15, hell2_center[1] - 15,

hell2_center[0] + 15, hell2_center[1] + 15,

fill='black')

# create oval

oval_center = origin + UNIT * 2

self.oval = self.canvas.create_oval(

oval_center[0] - 15, oval_center[1] - 15,

oval_center[0] + 15, oval_center[1] + 15,

fill='yellow')

# create red rect

self.rect = self.canvas.create_rectangle(

origin[0] - 15, origin[1] - 15,

origin[0] + 15, origin[1] + 15,

fill='red')

# pack all

self.canvas.pack()

def reset(self):

self.update()

time.sleep(0.5)

self.canvas.delete(self.rect)

origin = np.array([20, 20])

self.rect = self.canvas.create_rectangle(

origin[0] - 15, origin[1] - 15,

origin[0] + 15, origin[1] + 15,

fill='red')

# return observation

return self.canvas.coords(self.rect)

def step(self, action):

s = self.canvas.coords(self.rect)

base_action = np.array([0, 0])

if action == 0: # up

if s[1] > UNIT:

base_action[1] -= UNIT

elif action == 1: # down

if s[1] < (MAZE_H - 1) * UNIT:

base_action[1] += UNIT

elif action == 2: # right

if s[0] < (MAZE_W - 1) * UNIT:

base_action[0] += UNIT

elif action == 3: # left

if s[0] > UNIT:

base_action[0] -= UNIT

self.canvas.move(self.rect, base_action[0], base_action[1]) # move agent

s_ = self.canvas.coords(self.rect) # next state

# reward function

if s_ == self.canvas.coords(self.oval):

reward = 1

done = True

elif s_ in [self.canvas.coords(self.hell1), self.canvas.coords(self.hell2)]:

reward = -1

done = True

else:

reward = 0

done = False

return s_, reward, done

def render(self):

time.sleep(0.1)

self.update()

The Entire Reinforcement Logic in Python

When you are implementing the algorithm in Python, the structure looks like the following. The content is in the repo.

import numpy as np

import pandas as pd

class RL(object):

def __init__(self, action_space, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9):

self.actions = action_space # a list

self.lr = learning_rate

self.gamma = reward_decay

self.epsilon = e_greedy

self.q_table = pd.DataFrame(columns=self.actions)

def check_state_exist(self, state):

if state not in self.q_table.index:

# append new state to q table

self.q_table = self.q_table.append(

pd.Series(

[0]*len(self.actions),

index=self.q_table.columns,

name=state,

)

def choose_action(self, observation):

self.check_state_exist(observation)

# action selection

if np.random.rand() < self.epsilon:

# choose best action

state_action = self.q_table.ix[observation, :]

state_action = state_action.reindex(np.random.permutation(state_action.index)) # some actions have the same value

action = state_action.argmax()

else:

# choose random action

action = np.random.choice(self.actions)

return action

def learn(self, *args):

Pass

# off-policy

class QLearningTable(RL):

def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9):

super(QLearningTable, self).__init__(actions, learning_rate, reward_decay, e_greedy)

def learn(self, s, a, r, s_):

self.check_state_exist(s_)

q_predict = self.q_table.ix[s, a]

if s_ != 'terminal':

q_target = r + self.gamma * self.q_table.ix[s_, :].max() # next state is not terminal

else:

q_target = r # next state is terminal

self.q_table.ix[s, a] += self.lr * (q_target - q_predict) # update

# on-policy

class SarsaTable(RL):

def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9):

super(SarsaTable, self).__init__(actions, learning_rate, reward_decay, e_greedy)

def learn(self, s, a, r, s_, a_):

self.check_state_exist(s_)

q_predict = self.q_table.ix[s, a]

if s_ != 'terminal':

q_target = r + self.gamma * self.q_table.ix[s_, a_] # next state is not terminal

else:

q_target = r # next state is terminal

self.q_table.ix[s, a] += self.lr * (q_target - q_predict) # update

The learning process in its entirety looks like this in the code (RL_brain.py):

from maze_env import Maze

from RL_brain import SarsaTable

def update():

for episode in range(100):

# initial observation

observation = env.reset()

# RL choose action based on observation

action = RL.choose_action(str(observation))

while True:

# fresh env

env.render()

# RL take action and get next observation and reward

observation_, reward, done = env.step(action)

# RL choose action based on next observation

action_ = RL.choose_action(str(observation_))

# RL learn from this transition (s, a, r, s, a) ==> Sarsa

RL.learn(str(observation), action, reward, str(observation_), action_)

# swap observation and action

observation = observation_

action = action_

# break while loop when end of this episode

if done:

break

# end of game

print('game over')

env.destroy()

if __name__ == "__main__":

env = Maze()

RL = SarsaTable(actions=list(range(env.n_actions)))

env.after(100, update)

env.mainloop()

Let’s run the program and check it.

You can do this in the Anaconda environment, as shown in Figure 2-57.

Figure 2-57.

Activating the environment

You then have to consider the SARSA maze, as shown in Figure 2-58.

Figure 2-58.

Considering the SARSA maze

Now you have to call the run_this.py file to get the program running, as shown in Figure 2-59.

Figure 2-59.

Running run_this.py

To run the program from the terminal, use this command:

python run_this.py

After running the code, the program will play the maze, as shown in Figure 2-60.

Figure 2-60.

The program playing maze

Dynamic Programming in Reinforcement Learning

Problems that are sequential or temporal can be solved using dynamic programming. If you have a complex problem, you have to break it down into subproblems. Dynamic programming is the process of breaking a problem into subproblems, solving those subproblems, and finally combining them to solve the overall problem. The optimal substructure and the principle of optimality apply. The solution can be cached and reused. See Figure 2-61.

Figure 2-61.

Dynamic problem-solving approach

Conclusion

This chapter went through different algorithms related to Reinforcement Learning. You also saw a simple example of Reinforcement Learning using Python. You then learned about SARSA with the help of an example in Python. The chapter ended by discussing dynamic programming basics.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 2. RL Theory and Algorithms

Create new playlist

Sign In

Sign Up

2. RL Theory and Algorithms

Theoretical Basis of Reinforcement Learning

Where Reinforcement Learning Is Used

Manufacturing

Inventory Management

Delivery Management

Finance Sector

Why Is Reinforcement Learning Difficult?

Preparing the Machine

Installing Docker

An Example of Reinforcement Learning with Python

What Are Hyperparameters ?

Writing the Code

What Is MDP?

The Markov Property

The Markov Chain

MDPs

SARSA

Temporal Difference Learning

How SARSA Works

Q Learning

What Is Q?

How to Use Q

SARSA Implementation in Python

The Entire Reinforcement Logic in Python

Dynamic Programming in Reinforcement Learning

Conclusion

Table of Contents for
2. RL Theory and Algorithms