A small example

Anaconda doesn't ship this package, so it has to be installed though pip:

>>> pip install gym[atari]

We won't use the Atari part of the gym, but it will be required for the breakout game.

From this package, we can create an environment for different games, like this:

env = gym.make('FrozenLake-v0')

This creates a new environment for the text game FrozenLake. It consists of four four-character strings, starting with S and ending up at G, the goal. But there are holes (H) on the way to this goal, and ending up there makes you lose the game:

SFFF
FHFH
FFFH
HFFG

From the environment, we can get the size of the observation space, env.observation_space.n, which is 16 here (where the player is located) and the size of the action space env.action_space.n, which is 4 here.

As this is a small toy example, we can create an estimation of Q(s, a):

# Inspired by https://github.com/tensorlayer/tensorlayer/
#        blob/master/example/tutorial_frozenlake_q_table.py
Q = np.zeros((env.observation_space.n, env.action_space.n))
# Set learning hyperparameters
lr = .8
y = .95
num_episodes = 2000

# Let's run!
for i in range(num_episodes):
    # Reset environment and get first new observation (top left)
    s = env.reset()
    # Do 100 iterations to update the table
    for i in range(100):
        # Choose an action by picking the max of the table
        # + additional random noise ponderated by the episode
        a = np.argmax(Q[s,:]
            + np.random.randn(1, env.action_space.n) / (i + 1))
        # Get new state and reward from environment after chosen step 
        s1, r, d,_ = env.step(a)
        # Update Q-Table with new knowledge
        Q[s,a] = Q[s,a] + lr*(r + y*np.max(Q[s1,:]) - Q[s,a])
        s = s1
        if d == True:
            break

We can now display the content of the table Q:

 [[0.18118924 0.18976168 0.19044738 0.18260069]
 [0.03811294 0.19398589 0.18619181 0.18624451]
 [0.16266812 0.13309552 0.14401865 0.11183018]
 [0.02533285 0.12890984 0.02641699 0.15121063]
 [0.20015578 0.00201834 0.00902377 0.03619787]
 [0. 0. 0. 0. ]
 [0.1294778 0.04845176 0.03590482 0.13001683]
 [0. 0. 0. 0. ]
 [0.02543623 0.05444387 0.01170018 0.19347353]
 [0.06137181 0.43637431 0.00372395 0.00830249]
 [0.25205174 0.00709722 0.00908675 0.00296389]
 [0. 0. 0. 0. ]
 [0. 0. 0. 0. ]
 [0. 0.15032826 0.43034276 0.09982157]
 [0. 0.86241133 0. 0. ]
 [0. 0. 0. 0. ]]

We can see all the entries that have 0 in some of our rows; these are the holes and the final goal stage. Starting from the first step, we can go through this table to a next step with probabilities given by these rows (after normalization).

Of course, this is not a network, so let's use Tensorflow to make a network learn this table.

Table of Contents for A small example

Create new playlist

Sign In

Sign Up

Table of Contents for
A small example