© Abhishek Nandy and Manisha Biswas  2018
Abhishek Nandy and Manisha BiswasReinforcement Learning https://doi.org/10.1007/978-1-4842-3285-9_4

4. Applying Python to Reinforcement Learning

Abhishek Nandy and Manisha Biswas2
(1)
Rm HIG L-2/4, Bldg Swaranika Co-Opt HSG, Kolkata, West Bengal, India
(2)
North 24 Parganas, West Bengal, India
 
This chapter explores the world of Reinforcement Learning in terms of Python. First we go through Q learning with Python and then cover a more in-depth analysis of Reinforcement Learning. We start off by going through Q learning in terms of Python. Then we describe Swarm intelligence in Python, with an introduction to what exactly Swarm intelligence is. The chapter also covers the Markov decision process (MDP) toolbox.
Finally, you will be implementing a Game AI and will apply Reinforcement Learning to it. The chapter will be a good experience, so let’s begin!

Q Learning with Python

Let’s start with a maze problem. The object of the game is to reach the yellow circle while avoiding the black squares. Figure 4-1 shows the maze. We use the numpy library in this example.
A454310_1_En_4_Fig1_HTML.jpg
Figure 4-1.
The maze that demonstrates Q learning
We have to choose an action based on the Q table, which is why we have the function called choose_action. When we want to move from one state to another, we apply the decision-making process to the choose_action method as follows.
def choose_action(self,observation):
The learning process function takes the transition from state, award, reward and goes to the next state.
def check_State_exist(self,state)
The check_State_exist function allows us to check if the state exists and then to append it to the Q table if it does.
The content of the function we have discussed is actually for RL_brain, which is the basis of the project. The rules are updated for Q learning, as shown in the run _this.py file.

The Maze Environment Python File

The maze environment Python file, shown here, lists all the concepts for making moves. We declare rewards as well as ability to take the next step.
"""
Reinforcement learning maze example.
Red rectangle:          explorer.
Black rectangles:       hells       [reward = -1].
Yellow bin circle:      paradise    [reward = +1].
All other states:       ground      [reward = 0].
This script is the environment part of this example. The RL is in RL_brain.py.
View more on my tutorial page: https://morvanzhou.github.io/tutorials/
"""
import numpy as np
import time
import sys
if sys.version_info.major == 2:
    import Tkinter as tk
else:
    import tkinter as tk
UNIT = 40   # pixels
MAZE_H = 4  # grid height
MAZE_W = 4  # grid width
class Maze(tk.Tk, object):
    def __init__(self):
        super(Maze, self).__init__()
        self.action_space = ['u', 'd', 'l', 'r']
        self.n_actions = len(self.action_space)
        self.title('maze')
        self.geometry('{0}x{1}'.format(MAZE_H * UNIT, MAZE_H * UNIT))
        self._build_maze()
    def _build_maze(self):
        self.canvas = tk.Canvas(self, bg='white',
                           height=MAZE_H * UNIT,
                           width=MAZE_W * UNIT)
        # create grids
        for c in range(0, MAZE_W * UNIT, UNIT):
            x0, y0, x1, y1 = c, 0, c, MAZE_H * UNIT
            self.canvas.create_line(x0, y0, x1, y1)
        for r in range(0, MAZE_H * UNIT, UNIT):
            x0, y0, x1, y1 = 0, r, MAZE_H * UNIT, r
            self.canvas.create_line(x0, y0, x1, y1)
        # create origin
        origin = np.array([20, 20])
        # hell
        hell1_center = origin + np.array([UNIT * 2, UNIT])
        self.hell1 = self.canvas.create_rectangle(
            hell1_center[0] - 15, hell1_center[1] - 15,
            hell1_center[0] + 15, hell1_center[1] + 15,
            fill='black')
        # hell
        hell2_center = origin + np.array([UNIT, UNIT * 2])
        self.hell2 = self.canvas.create_rectangle(
            hell2_center[0] - 15, hell2_center[1] - 15,
            hell2_center[0] + 15, hell2_center[1] + 15,
            fill='black')
        # create oval
        oval_center = origin + UNIT * 2
        self.oval = self.canvas.create_oval(
            oval_center[0] - 15, oval_center[1] - 15,
            oval_center[0] + 15, oval_center[1] + 15,
            fill='yellow')
        # create red rect
        self.rect = self.canvas.create_rectangle(
            origin[0] - 15, origin[1] - 15,
            origin[0] + 15, origin[1] + 15,
            fill='red')
        # pack all
        self.canvas.pack()
    def reset(self):
        self.update()
        time.sleep(0.5)
        self.canvas.delete(self.rect)
        origin = np.array([20, 20])
        self.rect = self.canvas.create_rectangle(
            origin[0] - 15, origin[1] - 15,
            origin[0] + 15, origin[1] + 15,
            fill='red')
        # return observation
        return self.canvas.coords(self.rect)
    def step(self, action):
        s = self.canvas.coords(self.rect)
        base_action = np.array([0, 0])
        if action == 0:   # up
            if s[1] > UNIT:
                base_action[1] -= UNIT
        elif action == 1:   # down
            if s[1] < (MAZE_H - 1) * UNIT:
                base_action[1] += UNIT
        elif action == 2:   # right
            if s[0] < (MAZE_W - 1) * UNIT:
                base_action[0] += UNIT
        elif action == 3:   # left
            if s[0] > UNIT:
                base_action[0] -= UNIT
        self.canvas.move(self.rect, base_action[0], base_action[1])  # move agent
        s_ = self.canvas.coords(self.rect)  # next state
        # reward function
        if s_ == self.canvas.coords(self.oval):
            reward = 1
            done = True
        elif s_ in [self.canvas.coords(self.hell1), self.canvas.coords(self.hell2)]:
            reward = -1
            done = True
        else:
            reward = 0
            done = False
        return s_, reward, done
    def render(self):
        time.sleep(0.1)
        self.update()
def update():
    for t in range(10):
        s = env.reset()
        while True:
            env.render()
            a = 1
            s, r, done = env.step(a)
            if done:
                break
if __name__ == '__main__':
    env = Maze()
    env.after(100, update)
    env.mainloop()

The RL_Brain Python File

Now for the RL_brain Python file. We define the Q learning table structure that is generated while moving from one state to another. In the QLearningTable class, we structure the way the entire maze learns. We also declare hyperparameters for learning and determine the rate at which the program learns in the next chunk of code:
import numpy as np
import pandas as pd
class QLearningTable:
    def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9):
        self.actions = actions  # a list
        self.lr = learning_rate
        self.gamma = reward_decay
        self.epsilon = e_greedy
        self.q_table = pd.DataFrame(columns=self.actions)
    def choose_action(self, observation):
        self.check_state_exist(observation)
        # action selection
        if np.random.uniform() < self.epsilon:
            # choose best action
            state_action = self.q_table.ix[observation, :]
            state_action = state_action.reindex(np.random.permutation(state_action.index))     # some actions have same value
            action = state_action.argmax()
        else:
            # choose random action
            action = np.random.choice(self.actions)
        return action
    def learn(self, s, a, r, s_):
        self.check_state_exist(s_)
        q_predict = self.q_table.ix[s, a]
        if s_ != 'terminal':
            q_target = r + self.gamma * self.q_table.ix[s_, :].max()  # next state is not terminal
        else:
            q_target = r  # next state is terminal
        self.q_table.ix[s, a] += self.lr * (q_target - q_predict)  # update
    def check_state_exist(self, state):
        if state not in self.q_table.index:
            # append new state to q table
            self.q_table = self.q_table.append(
                pd.Series(
                    [0]*len(self.actions),
                    index=self.q_table.columns,
                    name=state,
                )
            )

Updating the Function

This code segment declares a function that receives updates on the movement in the maze from one state to another. It also gives out rewards when the player transitions from one state to another.
from maze_env import Maze
from RL_brain import QLearningTable
def update():
    for episode in range(100):
        # initial observation
        observation = env.reset()
        while True:
            # fresh env
            env.render()
            # RL choose action based on observation
            action = RL.choose_action(str(observation))
            # RL take action and get next observation and reward
            observation_, reward, done = env.step(action)
            # RL learn from this transition
            RL.learn(str(observation), action, reward, str(observation_))
            # swap observation
            observation = observation_
            # break while loop when end of this episode
            if done:
                break
    # end of game
    print('game over')
    env.destroy()
if __name__ == "__main__":
    env = Maze()
    RL = QLearningTable(actions=list(range(env.n_actions)))
    env.after(100, update)
    env.mainloop()
If you get inside the folder, you’ll see the run_this.py file and can get the output, as shown in Figure 4-2.
A454310_1_En_4_Fig2_HTML.jpg
Figure 4-2.
Running the file
Figure 4-3 shows the code running .
A454310_1_En_4_Fig3_HTML.jpg
Figure 4-3.
The maze file being run

Using the MDP Toolbox in Python

The MDP toolbox provides classes and functions for the resolution of discrete time Markov decision processes. The list of algorithms that have been implemented includes backwards induction, linear programming, policy iteration, Q learning, and value iteration along with several variations.
The following are the features of the MDP toolbox (see Figure 4-4):
  • Eight MDP algorithms
  • Fast array manipulation using NumPy
  • Full sparse matrix support using Scipy’s sparse package
  • Optional linear programming support using cvxopt
A454310_1_En_4_Fig4_HTML.jpg
Figure 4-4.
MDP toolbox features
Next, you see how to install and configure MDP toolbox for Python. First, switch to the Anaconda environment , as shown in Figure 4-5.
A454310_1_En_4_Fig5_HTML.jpg
Figure 4-5.
Activating the Anaconda environment
Now install the dependencies using this command (see Figure 4-6):
sudo apt-get install python3-numpy python3-scipy liblapack-dev libatlas-base-dev libgsl0-dev fftw-dev libglpk-dev libdsdp-dev
A454310_1_En_4_Fig6_HTML.jpg
Figure 4-6.
Installing the dependencies
When it asks you if it should install the dependencies, choose yes, as shown in Figure 4-7.
A454310_1_En_4_Fig7_HTML.jpg
Figure 4-7.
Choose yes to proceed
All the dependencies are then installed, as shown in Figure 4-8.
A454310_1_En_4_Fig8_HTML.jpg
Figure 4-8.
The dependencies are installed
Now you can go ahead and install the MDP toolbox, as shown in Figure 4-9.
A454310_1_En_4_Fig9_HTML.jpg
Figure 4-9.
Installing the MDP toolbox
The important packages are being installed, as shown in Figure 4-10.
A454310_1_En_4_Fig10_HTML.jpg
Figure 4-10.
Installing the important packages
If everything works as expected, you’ll get all the packages installed, as shown in Figure 4-11.
A454310_1_En_4_Fig11_HTML.jpg
Figure 4-11.
All the packages have been installed
Now you need to clone the repo from GitHub (see Figure 4-12):
git clone https://github.com/sawcordwell/pymdptoolbox.git
A454310_1_En_4_Fig12_HTML.jpg
Figure 4-12.
Cloning the repo
Switch to the mdptoolbox folder to see the details shown in Figure 4-13.
A454310_1_En_4_Fig13_HTML.jpg
Figure 4-13.
Getting inside the folder
You now need to switch to Python mode , as shown in Figure 4-14.
A454310_1_En_4_Fig14_HTML.jpg
Figure 4-14.
Inside Python mode
We will now use an example to see how the MDP toolbox works. First, import the MDP example, as shown in Figure 4-15.
A454310_1_En_4_Fig15_HTML.jpg
Figure 4-15.
Importing the modules
A Markov problem assumes that future states depend only on the current state, not on the events that occurred before. We will set up an example Markov problem using a discount value of 0.8. To use the built-in examples in the MDP toolbox, you need to import the mdptoolbox.example and solve it using a value iteration algorithm. Then you’ll need to check the optimal policy. The optimal policy is a function that allows the state to transition to the next state with maximum rewards.
You can check the policy with the vi.policy command, as shown in Figure 4-16.
A454310_1_En_4_Fig16_HTML.jpg
Figure 4-16.
Doing operations
The output for the policy is (0,0,0). The results show the discounted reward for the implemented policy.
Here is the full program:
import mdptoolbox.example
P, R = mdptoolbox.example.forest()
vi = mdptoolbox.mdp.ValueIteration(P, R, 0.8)
vi.run()
vi.policy # result is (0, 0, 0)
Let’s consider another example. First you need to import the toolbox and the toolbox example. Using the import example, you are bringing in the built-in examples that are in the MDP toolbox (see Figure 4-17).
import mdptoolbox, mdptoolbox.example
A454310_1_En_4_Fig17_HTML.jpg
Figure 4-17.
Another example of MDP
We implemented verbose mode in the previous example so we can display the current stage and policy transpose.
>>> import mdptoolbox, mdptoolbox.example
>>> P, R = mdptoolbox.example.forest()
>>> fh = mdptoolbox.mdp.FiniteHorizon(P, R, 0.9, 3)
>>> fh.run()
>>> fh.V
array([[ 2.6973, 0.81 , 0. , 0. ],
[ 5.9373, 3.24 , 1. , 0. ],
[ 9.9373, 7.24 , 4. , 0. ]])
>>> fh.policy
array([[0, 0, 0],
[0, 0, 1],
[0, 0, 0]])
The next example is also in verbose mode and each iteration displays the number of different actions between policy n-1 and n (see Figure 4-18).
A454310_1_En_4_Fig18_HTML.jpg
Figure 4-18.
Policy between n-1 and n
We are getting help from the built-in example of MDP, where we are trying to find the discounted MDP using a value iteration. As is the case with MDP, some of the values are randomly generated by using rand(10,3) and some of the values are provided by the decision-making process.
We try to solve an MDP by applying RL with a value iteration in this example:
>>> import mdptoolbox, mdptoolbox.example
>>> P, R = mdptoolbox.example.rand(10, 3)
>>> pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.9)
>>> pi.run()
>>> P, R = mdptoolbox.example.forest()
>>> pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.9)
>>> pi.run()
>>> expected = (26.244000000000014, 29.484000000000016, 33.484000000000016)
>>> all(expected[k] - pi.V[k] < 1e-12 for k in range(len(expected)))
    True
8.2. Markov Decision Process (MDP) Toolbox: mdp module 21
Python Markov Decision Process Toolbox Documentation, Release 4.0-b4
>>> pi.policy
(0, 0, 0)

Understanding Swarm Intelligence

Swarm intelligence is an important part of AI. It is the collective behavior of a decentralized, self-organized system, whether it be natural or artificial.
Swarm intelligence typically consists of a population of simple agents or boids (artificial life programs) interacting locally with one another and with their environment, as illustrated in Figure 4-19.
A454310_1_En_4_Fig19_HTML.jpg
Figure 4-19.
Swarm intelligence interactions

Applications of Swarm Intelligence

Figure 4-20 shows some applications of swarm intelligence.
A454310_1_En_4_Fig20_HTML.jpg
Figure 4-20.
Applications of swarm intelligence

Ant-Based Routing

When you are dealing with something similar to telecommunication networks, this is called ant-based routing . The idea of ant based routing is based on RL, as there is lot of forward and backward movement along a particular network packet, which can be called the ant. This results in flooding the entire network.

Crowd Simulations

In the movies, crowd simulations are done with the help of swarm optimization.

Human Swarming

The concept of human swarming is based on the collective usage of different minds to predict an answer. It’s when all of the brains of different human beings attempt to find a particular solution to a complex problem. Using collective brains in the form of human swarming results in more accurate results.

Swarm Grammars

Swarm grammars are particular characteristics that act as different swarms working together to get varied results. The results can be similar to art or architecture.

Swarmic Art

Combining different characteristics of swarm behaviors between different species of birds and fish can lead to swarmic art that shows patterns in swarm behavior.
Before we cover swarm intelligence in more detail, we touch on the Rastrigin function. Swarm optimization is based on different functions, one of which is the Rastrigin function, so you need to understand how it works.

The Rastrigin Function

In mathematical optimization problems, the Rastrigin function is a nonconvex function used as a performance test problem for optimization algorithms.
The formula is shown in Figure 4-21 and Figure 4-22 shows its typical output.
A454310_1_En_4_Fig21_HTML.jpg
Figure 4-21.
Depiction of the Rastrigin function
A454310_1_En_4_Fig22_HTML.jpg
Figure 4-22.
Rastrigin function output
Let’s get started with using the Rastrigin function in Python .
You need to activate the Anaconda environment first:
abhi@ubuntu:∼$ source activate universe
(universe) abhi@ubuntu:∼$
Now switch to Python mode:
(universe) abhi@ubuntu:∼$ python
Python 3.5.3 |Anaconda custom (64-bit)| (default, Mar  6 2017, 11:58:13)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
As we start building important libraries, Python will cache them if they are not created, as shown in Figure 4-23.
A454310_1_En_4_Fig23_HTML.jpg
Figure 4-23.
Cache being created
The entire flow of the Python program is as follows:
python
Python 3.5.3 |Anaconda custom (64-bit)| (default, Mar  6 2017, 11:58:13)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from matplotlib import cm
>>> from mpl_toolkits.mplot3d import Axes3D
/home/abhi/anaconda3/envs/universe/lib/python3.5/site-packages/matplotlib/font_manager.py:280: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  'Matplotlib is building the font cache using fc-list. '
>>> import math
>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> def rastrigin(*X, **kwargs):
...     A = kwargs.get('A', 10)
...     return A + sum([(x**2 - A * np.cos(2 * math.pi * x)) for x in X])
...
>>> if __name__ == '__main__':
...     X = np.linspace(-4, 4, 200)
...     Y = np.linspace(-4, 4, 200)
...
>>>     X, Y = np.meshgrid(X, Y)
  File "<stdin>", line 1
    X, Y = np.meshgrid(X, Y)
    ^
IndentationError: unexpected indent
>>>
>>>     Z = rastrigin(X, Y, A=10)
  File "<stdin>", line 1
    Z = rastrigin(X, Y, A=10)
    ^
IndentationError: unexpected indent
>>>
>>>     fig = plt.figure()
  File "<stdin>", line 1
    fig = plt.figure()
    ^
IndentationError: unexpected indent
>>>     ax = fig.gca(projection='3d')
  File "<stdin>", line 1
    ax = fig.gca(projection='3d')
    ^
IndentationError: unexpected indent
>>>
>>>     ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=cm.plasma, linewidth=0, antialiased=False)
  File "<stdin>", line 1
    ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=cm.plasma, linewidth=0, antialiased=False)
    ^
IndentationError: unexpected indent
>>>     plt.savefig('rastrigin.png')
  File "<stdin>", line 1
    plt.savefig('rastrigin.png')
    ^
IndentationError: unexpected indent
>>> if __name__ == '__main__':
...     X = np.linspace(-4, 4, 200)
...     Y = np.linspace(-4, 4, 200)
...
>>> X, Y = np.meshgrid(X, Y)
>>> Z = rastrigin(X, Y, A=10)
>>> fig = plt.figure()
>>> ax = fig.gca(projection='3d')
>>> ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=cm.plasma, linewidth=0, antialiased=False)
<mpl_toolkits.mplot3d.art3d.Poly3DCollection object at 0x7f79cfc73780>
>>> plt.savefig('rastrigin.png')
>>>
If you go back to the folder, you can see that the rastrigin.png file was created, as shown in Figure 4-24.
A454310_1_En_4_Fig24_HTML.jpg
Figure 4-24.
Rastrigin function PNG file being saved
The rastrigin.png file’s output from the problem shows the minima, as shown in Figure 4-25. It is very difficult to find the global optimum.
A454310_1_En_4_Fig25_HTML.jpg
Figure 4-25.
The Rastrigin function PNG file

Swarm Intelligence in Python

This section looks at a program in Python that works with the concept of swarm intelligence. You will therefore get to know particle swarm optimization (PSO) within Python. You can achieve this with the help of a research toolkit known as PySwarms.
PySwarms is a good tool to implement optimization algorithms with the PSO method, such as:
  • Star topology
  • Ring topology
First, you need to install PySwarms. Get inside the terminal and activate the Anaconda environment using the following command.
abhi@ubuntu:∼$ source activate universe
(universe) abhi@ubuntu:∼$
The dependencies prior to installing PySwarms are as follows:
numpy >= 1.13.0
scipy >= 0.17.0
matplotlib >= 1.3.1
Now install PySwarms as follows:
(universe) abhi@ubuntu:∼$ pip install pyswarms
Now the process is complete.
Figure 4-26 shows that PySwarms is completely installed.
A454310_1_En_4_Fig26_HTML.jpg
Figure 4-26.
PySwarms are installed
Now we move to Python mode.
(universe) abhi@ubuntu:∼$ python
Python 3.5.3 |Anaconda custom (64-bit)| (default, Mar  6 2017, 11:58:13)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
First, you need to import the PySwarms utilities as follows:
>>> import pyswarms as ps
There are different functions that you can use in PySwarms for that you have to import:
>>> from pyswarms.utils.functions import single_obj as fx
Next, you need to declare these hyperparameters:
>>> options = {'c1': 0.5, 'c2': 0.3, 'w':0.9}
In this case, we are configuring the swarm as a dictionary, so call it a dictionary.
In the next step, you create the instance of the optimizer by passing the dictionary with the necessary arguments.
>>> optimizer = ps.single.GlobalBestPSO(n_particles=10, dimensions=2, options=options)
After that, call the optimizer method and store the optimal cost and position after optimization. Figure 4-27 shows the results.
A454310_1_En_4_Fig27_HTML.jpg
Figure 4-27.
Showing the result
After going through the results, you can see that optimizer was able to find a good minima.
You will now do the same using the local best PSO. You configure and similarly declare a dictionary as follows:
>>> options = {'c1': 0.5, 'c2': 0.3, 'w':0.9, 'k': 2, 'p': 2}
Create the instance of the optimizer:
>>> optimizer = ps.single.LocalBestPSO(n_particles=10, dimensions=2, options=options)
Now you call the optimize method to store the value as you did before.
By using the verbose argument, you can control the verbosity of the argument and use print_step to count after a certain number of steps.
>>> cost, pos = optimizer.optimize(fx.sphere_func, print_step=50, iters=1000, verbose=3)
The output is shown in Figure 4-28.
A454310_1_En_4_Fig28_HTML.jpg
Figure 4-28.
The output of the swarm optimization

Building a Game AI

We have already discussed the game AI with OpenAI Gym and environment simulation, but we take it further in this section. First, we will clone one of the most important and simplest examples of game AI, as shown in Figure 4-29.
A454310_1_En_4_Fig29_HTML.jpg
Figure 4-29.
Cloning the repo
You first need to set up the environment. The requirements are as follows:
  • TensorFlow
  • OpenAI Gym
  • virtualenv
  • TFLearn
There is one dependency to install—the virtual environment. You install it using this command:
conda install -c anaconda virtualenv
It will ask you whether you want to install the new virtualenv package, as shown in Figure 4-30. Choose yes.
A454310_1_En_4_Fig30_HTML.jpg
Figure 4-30.
Getting the virtualenv package
When the package installation is successful and complete, you’ll see the screen in Figure 4-31.
A454310_1_En_4_Fig31_HTML.jpg
Figure 4-31.
Package installation is complete
Now you can install TFLearn using this command:
conda install -c derickl tflearn
When you attempt to install TFLearn, you may get this error about an OS version mismatch:
conda install -c derickl tflearn
Fetching package metadata .........
Solving package specifications: .
PackageNotFoundError: Package not found: '' Package missing in current linux-64 channels:
  - tflearn
You can search for packages on anaconda.org with
    anaconda search -t conda tflearn
(universe) abhi@ubuntu:∼$ anaconda search -t conda tflearn
Using Anaconda API: https://api.anaconda.org
Run 'anaconda show <USER/PACKAGE>' to get more details:
Packages:
     Name                      |  Version | Package Types   | Platforms
     ------------------------- |   ------ | --------------- | ---------------
     asherp/tflearn            |    0.2.2 | conda           | osx-64
     contango/tflearn          |    0.3.2 | conda           | linux-64
     derickl/tflearn           |    0.2.2 | conda           | osx-64
Found 3 packages
If this happens, be sure to install the one that’s for linux-64:
(universe) abhi@ubuntu:∼$ anaconda show contango/tflearn
Using Anaconda API: https://api.anaconda.org
Name:    tflearn
Summary:
Access:  public
Package Types:  conda
Versions:
   + 0.3.2
To install this package with Anaconda, run the following command :
conda install --channel https://conda.anaconda.org/contango tflearn
It will ask for installation of other packages, as shown in Figure 4-32.
A454310_1_En_4_Fig32_HTML.jpg
Figure 4-32.
Installation of other packages
Now import the relevant libraries using this command :
(universe) abhi@ubuntu:∼$ python
Python 3.5.3 |Anaconda custom (64-bit)| (default, Mar  6 2017, 11:58:13)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import gym
>>> import random
>>> import numpy as np
>>> import tflearn
>>> from tflearn.layers.core import input_data, dropout, fully_connected
>>> from tflearn.layers.estimator import regression
>>> from statistics import median, mean
>>> from collections import Counter
>>> LR = 1e-3
>>> env = gym.make("CartPole-v0")
[2017-09-22 08:22:15,933] Making new env: CartPole-v0
>>> env.reset()
array([-0.03283849, -0.04877971,  0.0408221 , -0.01600674])

The Entire TFLearn Code

To start with, you need to import the important libraries. TFLearn creates the prototyping so the program can implement RL very quickly.
Add a learning rate. You do this by initializing a simulated environment and then indicating the movement pattern with the following command:
action = env.action_space.sample()
This example pairs the observation with is the movement of the balanced cart-pole (moving left or right). In the given problem, the basis of RL is the score that we are referencing.
After applying the RL, we are training the model with TFLearn, a module for TensorFlow that’s used to create a fully connected neural network and produce a faster training process.
import gym
import random
import numpy as np
import tflearn
from tflearn.layers.core import input_data, dropout, fully_connected
from tflearn.layers.estimator import regression
from statistics import median, mean
from collections import Counter
LR = 1e-3
env = gym.make("CartPole-v0")
env.reset()
goal_steps = 500
score_requirement = 50
initial_games = 10000
def some_random_games_first():
    # Each of these is its own game.
    for episode in range(5):
        env.reset()
        # this is each frame, up to 200...but we wont make it that far.
        for t in range(200):
            # This will display the environment
            # Only display if you really want to see it.
            # Takes much longer to display it.
            env.render()
            # This will just create a sample action in any environment.
            # In this environment, the action can be 0 or 1, which is left or right
            action = env.action_space.sample()
            # this executes the environment with an action,
            # and returns the observation of the environment,
            # the reward, if the env is over, and other info.
            observation, reward, done, info = env.step(action)
            if done:
                break
some_random_games_first()
def initial_population():
    # [OBS, MOVES]
    training_data = []
    # all scores:
    scores = []
    # just the scores that met our threshold:
    accepted_scores = []
    # iterate through however many games we want:
    for _ in range(initial_games):
        score = 0
        # moves specifically from this environment:
        game_memory = []
        # previous observation that we saw
        prev_observation = []
        # for each frame in 200
        for _ in range(goal_steps):
            # choose random action (0 or 1)
            action = random.randrange(0,2)
            # do it!
            observation, reward, done, info = env.step(action)
            # notice that the observation is returned FROM the action
            # so we'll store the previous observation here, pairing
            # the prev observation to the action we'll take.
            if len(prev_observation) > 0 :
                game_memory.append([prev_observation, action])
            prev_observation = observation
            score+=reward
            if done: break
        # IF our score is higher than our threshold, we'd like to save
        # every move we made
        # NOTE the reinforcement methodology here.
        # all we're doing is reinforcing the score, we're not trying
        # to influence the machine in any way as to HOW that score is
        # reached.
        if score >= score_requirement:
            accepted_scores.append(score)
            for data in game_memory:
                # convert to one-hot (this is the output layer for our neural network)
                if data[1] == 1:
                    output = [0,1]
                elif data[1] == 0:
                    output = [1,0]
                # saving our training data
                training_data.append([data[0], output])
        # reset env to play again
        env.reset()
        # save overall scores
        scores.append(score)
    # just in case you wanted to reference later
    training_data_save = np.array(training_data)
    np.save('saved.npy',training_data_save)
    # some stats here, to further illustrate the neural network magic!
    print('Average accepted score:',mean(accepted_scores))
    print('Median score for accepted scores:',median(accepted_scores))
    print(Counter(accepted_scores))
    return training_data
    def neural_network_model(input_size):
    network = input_data(shape=[None, input_size, 1], name='input')
    network = fully_connected(network, 128, activation='relu')
    network = dropout(network, 0.8)
    network = fully_connected(network, 256, activation='relu')
    network = dropout(network, 0.8)
    network = fully_connected(network, 512, activation='relu')
    network = dropout(network, 0.8)
    network = fully_connected(network, 256, activation='relu')
    network = dropout(network, 0.8)
    network = fully_connected(network, 128, activation='relu')
    network = dropout(network, 0.8)
    network = fully_connected(network, 2, activation='softmax')
    network = regression(network, optimizer='adam', learning_rate=LR, loss='categorical_crossentropy', name='targets')
    model = tflearn.DNN(network, tensorboard_dir='log')
    return model
def train_model(training_data, model=False):
    X = np.array([i[0] for i in training_data]).reshape(-1,len(training_data[0][0]),1)
    y = [i[1] for i in training_data]
    if not model:
        model = neural_network_model(input_size = len(X[0]))
        x = np.reshape(x, (-1, 30, 9))
    model.fit({'input': X}, {'targets': y}, n_epoch=5, snapshot_step=500, show_metric=True, run_id='openai_learning')
    return model
    model = train_model(training_data)
    scores = []
choices = []
for each_game in range(10):
    score = 0
    game_memory = []
    prev_obs = []
    env.reset()
    for _ in range(goal_steps):
        env.render()
        if len(prev_obs)==0:
            action = random.randrange(0,2)
        else:
            action = np.argmax(model.predict(prev_obs.reshape(-1,len(prev_obs),1))[0])
        choices.append(action)
        new_observation, reward, done, info = env.step(action)
        prev_obs = new_observation
        game_memory.append([new_observation, action])
        score+=reward
        if done: break
    scores.append(score)
print('Average Score:',sum(scores)/len(scores))
print('choice 1:{}  choice 0:{}'.format(choices.count(1)/len(choices),choices.count(0)/len(choices)))
print(score_requirement)
Here is the output :
Average Score: 195.9
choice 1:0.5074017355793773  choice 0:0.49259826442062277
50
Solved.

Conclusion

This chapter touched on Q learning and then showed some examples. It also covered the MDP toolbox, swarm intelligence, and game AI, and ended with a full example. Chapter 5 covers Reinforcement Learning with Keras, TensorFlow, and ChainerRL.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset