Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Abhishek Nandy and Manisha BiswasReinforcement Learning https://doi.org/10.1007/978-1-4842-3285-9_4

4. Applying Python to Reinforcement Learning

Abhishek Nandy¹ and Manisha Biswas²

(1)

Rm HIG L-2/4, Bldg Swaranika Co-Opt HSG, Kolkata, West Bengal, India

(2)

North 24 Parganas, West Bengal, India

This chapter explores the world of Reinforcement Learning in terms of Python. First we go through Q learning with Python and then cover a more in-depth analysis of Reinforcement Learning. We start off by going through Q learning in terms of Python. Then we describe Swarm intelligence in Python, with an introduction to what exactly Swarm intelligence is. The chapter also covers the Markov decision process (MDP) toolbox.

Finally, you will be implementing a Game AI and will apply Reinforcement Learning to it. The chapter will be a good experience, so let’s begin!

Q Learning with Python

Let’s start with a maze problem. The object of the game is to reach the yellow circle while avoiding the black squares. Figure 4-1 shows the maze. We use the numpy library in this example.

Figure 4-1.

The maze that demonstrates Q learning

We have to choose an action based on the Q table, which is why we have the function called choose_action. When we want to move from one state to another, we apply the decision-making process to the choose_action method as follows.

def choose_action(self,observation):

The learning process function takes the transition from state, award, reward and goes to the next state.

def check_State_exist(self,state)

The check_State_exist function allows us to check if the state exists and then to append it to the Q table if it does.

The content of the function we have discussed is actually for RL_brain, which is the basis of the project. The rules are updated for Q learning, as shown in the run _this.py file.

The Maze Environment Python File

The maze environment Python file, shown here, lists all the concepts for making moves. We declare rewards as well as ability to take the next step.

"""

Reinforcement learning maze example.

Red rectangle: explorer.

Black rectangles: hells [reward = -1].

Yellow bin circle: paradise [reward = +1].

All other states: ground [reward = 0].

This script is the environment part of this example. The RL is in RL_brain.py.

View more on my tutorial page: https://morvanzhou.github.io/tutorials/

"""

import numpy as np

import time

import sys

if sys.version_info.major == 2:

import Tkinter as tk

else:

import tkinter as tk

UNIT = 40 # pixels

MAZE_H = 4 # grid height

MAZE_W = 4 # grid width

class Maze(tk.Tk, object):

def __init__(self):

super(Maze, self).__init__()

self.action_space = ['u', 'd', 'l', 'r']

self.n_actions = len(self.action_space)

self.title('maze')

self.geometry('{0}x{1}'.format(MAZE_H * UNIT, MAZE_H * UNIT))

self._build_maze()

def _build_maze(self):

self.canvas = tk.Canvas(self, bg='white',

height=MAZE_H * UNIT,

width=MAZE_W * UNIT)

# create grids

for c in range(0, MAZE_W * UNIT, UNIT):

x0, y0, x1, y1 = c, 0, c, MAZE_H * UNIT

self.canvas.create_line(x0, y0, x1, y1)

for r in range(0, MAZE_H * UNIT, UNIT):

x0, y0, x1, y1 = 0, r, MAZE_H * UNIT, r

self.canvas.create_line(x0, y0, x1, y1)

# create origin

origin = np.array([20, 20])

# hell

hell1_center = origin + np.array([UNIT * 2, UNIT])

self.hell1 = self.canvas.create_rectangle(

hell1_center[0] - 15, hell1_center[1] - 15,

hell1_center[0] + 15, hell1_center[1] + 15,

fill='black')

# hell

hell2_center = origin + np.array([UNIT, UNIT * 2])

self.hell2 = self.canvas.create_rectangle(

hell2_center[0] - 15, hell2_center[1] - 15,

hell2_center[0] + 15, hell2_center[1] + 15,

fill='black')

# create oval

oval_center = origin + UNIT * 2

self.oval = self.canvas.create_oval(

oval_center[0] - 15, oval_center[1] - 15,

oval_center[0] + 15, oval_center[1] + 15,

fill='yellow')

# create red rect

self.rect = self.canvas.create_rectangle(

origin[0] - 15, origin[1] - 15,

origin[0] + 15, origin[1] + 15,

fill='red')

# pack all

self.canvas.pack()

def reset(self):

self.update()

time.sleep(0.5)

self.canvas.delete(self.rect)

origin = np.array([20, 20])

self.rect = self.canvas.create_rectangle(

origin[0] - 15, origin[1] - 15,

origin[0] + 15, origin[1] + 15,

fill='red')

# return observation

return self.canvas.coords(self.rect)

def step(self, action):

s = self.canvas.coords(self.rect)

base_action = np.array([0, 0])

if action == 0: # up

if s[1] > UNIT:

base_action[1] -= UNIT

elif action == 1: # down

if s[1] < (MAZE_H - 1) * UNIT:

base_action[1] += UNIT

elif action == 2: # right

if s[0] < (MAZE_W - 1) * UNIT:

base_action[0] += UNIT

elif action == 3: # left

if s[0] > UNIT:

base_action[0] -= UNIT

self.canvas.move(self.rect, base_action[0], base_action[1]) # move agent

s_ = self.canvas.coords(self.rect) # next state

# reward function

if s_ == self.canvas.coords(self.oval):

reward = 1

done = True

elif s_ in [self.canvas.coords(self.hell1), self.canvas.coords(self.hell2)]:

reward = -1

done = True

else:

reward = 0

done = False

return s_, reward, done

def render(self):

time.sleep(0.1)

self.update()

def update():

for t in range(10):

s = env.reset()

while True:

env.render()

a = 1

s, r, done = env.step(a)

if done:

break

if __name__ == '__main__':

env = Maze()

env.after(100, update)

env.mainloop()

The RL_Brain Python File

Now for the RL_brain Python file. We define the Q learning table structure that is generated while moving from one state to another. In the QLearningTable class, we structure the way the entire maze learns. We also declare hyperparameters for learning and determine the rate at which the program learns in the next chunk of code:

import numpy as np

import pandas as pd

class QLearningTable:

def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9):

self.actions = actions # a list

self.lr = learning_rate

self.gamma = reward_decay

self.epsilon = e_greedy

self.q_table = pd.DataFrame(columns=self.actions)

def choose_action(self, observation):

self.check_state_exist(observation)

# action selection

if np.random.uniform() < self.epsilon:

# choose best action

state_action = self.q_table.ix[observation, :]

state_action = state_action.reindex(np.random.permutation(state_action.index)) # some actions have same value

action = state_action.argmax()

else:

# choose random action

action = np.random.choice(self.actions)

return action

def learn(self, s, a, r, s_):

self.check_state_exist(s_)

q_predict = self.q_table.ix[s, a]

if s_ != 'terminal':

q_target = r + self.gamma * self.q_table.ix[s_, :].max() # next state is not terminal

else:

q_target = r # next state is terminal

self.q_table.ix[s, a] += self.lr * (q_target - q_predict) # update

def check_state_exist(self, state):

if state not in self.q_table.index:

# append new state to q table

self.q_table = self.q_table.append(

pd.Series(

[0]*len(self.actions),

index=self.q_table.columns,

name=state,

)

Updating the Function

This code segment declares a function that receives updates on the movement in the maze from one state to another. It also gives out rewards when the player transitions from one state to another.

from maze_env import Maze

from RL_brain import QLearningTable

def update():

for episode in range(100):

# initial observation

observation = env.reset()

while True:

# fresh env

env.render()

# RL choose action based on observation

action = RL.choose_action(str(observation))

# RL take action and get next observation and reward

observation_, reward, done = env.step(action)

# RL learn from this transition

RL.learn(str(observation), action, reward, str(observation_))

# swap observation

observation = observation_

# break while loop when end of this episode

if done:

break

# end of game

print('game over')

env.destroy()

if __name__ == "__main__":

env = Maze()

RL = QLearningTable(actions=list(range(env.n_actions)))

env.after(100, update)

env.mainloop()

If you get inside the folder, you’ll see the run_this.py file and can get the output, as shown in Figure 4-2.

Figure 4-2.

Running the file

Figure 4-3 shows the code running .

Figure 4-3.

The maze file being run

Using the MDP Toolbox in Python

The MDP toolbox provides classes and functions for the resolution of discrete time Markov decision processes. The list of algorithms that have been implemented includes backwards induction, linear programming, policy iteration, Q learning, and value iteration along with several variations.

The following are the features of the MDP toolbox (see Figure 4-4):

Eight MDP algorithms
Fast array manipulation using NumPy
Full sparse matrix support using Scipy’s sparse package
Optional linear programming support using cvxopt

Figure 4-4.

MDP toolbox features

Next, you see how to install and configure MDP toolbox for Python. First, switch to the Anaconda environment , as shown in Figure 4-5.

Figure 4-5.

Activating the Anaconda environment

Now install the dependencies using this command (see Figure 4-6):

sudo apt-get install python3-numpy python3-scipy liblapack-dev libatlas-base-dev libgsl0-dev fftw-dev libglpk-dev libdsdp-dev

Figure 4-6.

Installing the dependencies

When it asks you if it should install the dependencies, choose yes, as shown in Figure 4-7.

Figure 4-7.

Choose yes to proceed

All the dependencies are then installed, as shown in Figure 4-8.

Figure 4-8.

The dependencies are installed

Now you can go ahead and install the MDP toolbox, as shown in Figure 4-9.

Figure 4-9.

Installing the MDP toolbox

The important packages are being installed, as shown in Figure 4-10.

Figure 4-10.

Installing the important packages

If everything works as expected, you’ll get all the packages installed, as shown in Figure 4-11.

Figure 4-11.

All the packages have been installed

Now you need to clone the repo from GitHub (see Figure 4-12):

git clone https://github.com/sawcordwell/pymdptoolbox.git

Figure 4-12.

Cloning the repo

Switch to the mdptoolbox folder to see the details shown in Figure 4-13.

Figure 4-13.

Getting inside the folder

You now need to switch to Python mode , as shown in Figure 4-14.

Figure 4-14.

Inside Python mode

We will now use an example to see how the MDP toolbox works. First, import the MDP example, as shown in Figure 4-15.

Figure 4-15.

Importing the modules

A Markov problem assumes that future states depend only on the current state, not on the events that occurred before. We will set up an example Markov problem using a discount value of 0.8. To use the built-in examples in the MDP toolbox, you need to import the mdptoolbox.example and solve it using a value iteration algorithm. Then you’ll need to check the optimal policy. The optimal policy is a function that allows the state to transition to the next state with maximum rewards.

You can check the policy with the vi.policy command, as shown in Figure 4-16.

Figure 4-16.

Doing operations

The output for the policy is (0,0,0). The results show the discounted reward for the implemented policy.

Here is the full program:

import mdptoolbox.example

P, R = mdptoolbox.example.forest()

vi = mdptoolbox.mdp.ValueIteration(P, R, 0.8)

vi.run()

vi.policy # result is (0, 0, 0)

Let’s consider another example. First you need to import the toolbox and the toolbox example. Using the import example, you are bringing in the built-in examples that are in the MDP toolbox (see Figure 4-17).

import mdptoolbox, mdptoolbox.example

Figure 4-17.

Another example of MDP

We implemented verbose mode in the previous example so we can display the current stage and policy transpose.

>>> import mdptoolbox, mdptoolbox.example

>>> P, R = mdptoolbox.example.forest()

>>> fh = mdptoolbox.mdp.FiniteHorizon(P, R, 0.9, 3)

>>> fh.run()

>>> fh.V

array([[ 2.6973, 0.81 , 0. , 0. ],

[ 5.9373, 3.24 , 1. , 0. ],

[ 9.9373, 7.24 , 4. , 0. ]])

>>> fh.policy

array([[0, 0, 0],

[0, 0, 1],

[0, 0, 0]])

The next example is also in verbose mode and each iteration displays the number of different actions between policy n-1 and n (see Figure 4-18).

Figure 4-18.

Policy between n-1 and n

We are getting help from the built-in example of MDP, where we are trying to find the discounted MDP using a value iteration. As is the case with MDP, some of the values are randomly generated by using rand(10,3) and some of the values are provided by the decision-making process.

We try to solve an MDP by applying RL with a value iteration in this example:

>>> import mdptoolbox, mdptoolbox.example

>>> P, R = mdptoolbox.example.rand(10, 3)

>>> pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.9)

>>> pi.run()

>>> P, R = mdptoolbox.example.forest()

>>> pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.9)

>>> pi.run()

>>> expected = (26.244000000000014, 29.484000000000016, 33.484000000000016)

>>> all(expected[k] - pi.V[k] < 1e-12 for k in range(len(expected)))

True

8.2. Markov Decision Process (MDP) Toolbox: mdp module 21

Python Markov Decision Process Toolbox Documentation, Release 4.0-b4

>>> pi.policy

(0, 0, 0)

Understanding Swarm Intelligence

Swarm intelligence is an important part of AI. It is the collective behavior of a decentralized, self-organized system, whether it be natural or artificial.

Swarm intelligence typically consists of a population of simple agents or boids (artificial life programs) interacting locally with one another and with their environment, as illustrated in Figure 4-19.

Figure 4-19.

Swarm intelligence interactions

Applications of Swarm Intelligence

Figure 4-20 shows some applications of swarm intelligence.

Figure 4-20.

Applications of swarm intelligence

Ant-Based Routing

When you are dealing with something similar to telecommunication networks, this is called ant-based routing . The idea of ant based routing is based on RL, as there is lot of forward and backward movement along a particular network packet, which can be called the ant. This results in flooding the entire network.

Crowd Simulations

In the movies, crowd simulations are done with the help of swarm optimization.

Human Swarming

The concept of human swarming is based on the collective usage of different minds to predict an answer. It’s when all of the brains of different human beings attempt to find a particular solution to a complex problem. Using collective brains in the form of human swarming results in more accurate results.

Swarm Grammars

Swarm grammars are particular characteristics that act as different swarms working together to get varied results. The results can be similar to art or architecture.

Swarmic Art

Combining different characteristics of swarm behaviors between different species of birds and fish can lead to swarmic art that shows patterns in swarm behavior.

Before we cover swarm intelligence in more detail, we touch on the Rastrigin function. Swarm optimization is based on different functions, one of which is the Rastrigin function, so you need to understand how it works.

The Rastrigin Function

In mathematical optimization problems, the Rastrigin function is a nonconvex function used as a performance test problem for optimization algorithms.

The formula is shown in Figure 4-21 and Figure 4-22 shows its typical output.

Figure 4-21.

Depiction of the Rastrigin function

Figure 4-22.

Rastrigin function output

Let’s get started with using the Rastrigin function in Python .

You need to activate the Anaconda environment first:

abhi@ubuntu:∼$ source activate universe

(universe) abhi@ubuntu:∼$

Now switch to Python mode:

(universe) abhi@ubuntu:∼$ python

Python 3.5.3 |Anaconda custom (64-bit)| (default, Mar 6 2017, 11:58:13)

[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux

Type "help", "copyright", "credits" or "license" for more information.

>>>

As we start building important libraries, Python will cache them if they are not created, as shown in Figure 4-23.

Figure 4-23.

Cache being created

The entire flow of the Python program is as follows:

python

Python 3.5.3 |Anaconda custom (64-bit)| (default, Mar 6 2017, 11:58:13)

[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux

Type "help", "copyright", "credits" or "license" for more information.

>>> from matplotlib import cm

>>> from mpl_toolkits.mplot3d import Axes3D

/home/abhi/anaconda3/envs/universe/lib/python3.5/site-packages/matplotlib/font_manager.py:280: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.

'Matplotlib is building the font cache using fc-list. '

>>> import math

>>> import matplotlib.pyplot as plt

>>> import numpy as np

>>> def rastrigin(*X, **kwargs):

... A = kwargs.get('A', 10)

... return A + sum([(x**2 - A * np.cos(2 * math.pi * x)) for x in X])

...

>>> if __name__ == '__main__':

... X = np.linspace(-4, 4, 200)

... Y = np.linspace(-4, 4, 200)

...

>>> X, Y = np.meshgrid(X, Y)

File "<stdin>", line 1

X, Y = np.meshgrid(X, Y)