A small example

Anaconda doesn't ship this package, so it has to be installed though pip:

>>> pip install gym[atari]
We won't use the Atari part of the gym, but it will be required for the breakout game.

From this package, we can create an environment for different games, like this:

env = gym.make('FrozenLake-v0')

This creates a new environment for the text game FrozenLake. It consists of four four-character strings, starting with S and ending up at G, the goal. But there are holes (H) on the way to this goal, and ending up there makes you lose the game:

  • SFFF
  • FHFH
  • FFFH
  • HFFG

From the environment, we can get the size of the observation space, env.observation_space.n, which is 16 here (where the player is located) and the size of the action space env.action_space.n, which is 4 here.

As this is a small toy example, we can create an estimation of Q(s, a):

# Inspired by https://github.com/tensorlayer/tensorlayer/
# blob/master/example/tutorial_frozenlake_q_table.py
Q = np.zeros((env.observation_space.n, env.action_space.n))
# Set learning hyperparameters
lr = .8
y = .95
num_episodes = 2000

# Let's run!
for i in range(num_episodes):
# Reset environment and get first new observation (top left)
s = env.reset()
# Do 100 iterations to update the table
for i in range(100):
# Choose an action by picking the max of the table
# + additional random noise ponderated by the episode
a = np.argmax(Q[s,:]
+ np.random.randn(1, env.action_space.n) / (i + 1))
# Get new state and reward from environment after chosen step
s1, r, d,_ = env.step(a)
# Update Q-Table with new knowledge
Q[s,a] = Q[s,a] + lr*(r + y*np.max(Q[s1,:]) - Q[s,a])
s = s1
if d == True:
break

We can now display the content of the table Q:

 [[0.18118924 0.18976168 0.19044738 0.18260069]
[0.03811294 0.19398589 0.18619181 0.18624451]
[0.16266812 0.13309552 0.14401865 0.11183018]
[0.02533285 0.12890984 0.02641699 0.15121063]
[0.20015578 0.00201834 0.00902377 0.03619787]
[0. 0. 0. 0. ]
[0.1294778 0.04845176 0.03590482 0.13001683]
[0. 0. 0. 0. ]
[0.02543623 0.05444387 0.01170018 0.19347353]
[0.06137181 0.43637431 0.00372395 0.00830249]
[0.25205174 0.00709722 0.00908675 0.00296389]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0.15032826 0.43034276 0.09982157]
[0. 0.86241133 0. 0. ]
[0. 0. 0. 0. ]]

We can see all the entries that have 0 in some of our rows; these are the holes and the final goal stage. Starting from the first step, we can go through this table to a next step with probabilities given by these rows (after normalization).

Of course, this is not a network, so let's use Tensorflow to make a network learn this table.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset