Using Tensorflow for the text game

Let's think of the type of architecture we need here. We have the state of the game as the input, and we want one of four values as the output. The game is simple enough that there is an optimal strategy, a unique path to get from the start to the goal. This means that the network can be very simple, with just one layer and a linear output:

inputs = tf.placeholder(shape=[None, 16], dtype=tf.float32, name="input")
Qout = tf.layers.dense(
inputs=inputs,
units=4,
use_bias=False,
name="dense",
kernel_initializer=
tf.random_uniform_initializer(minval=0, maxval=.0125)
)
predict = tf.argmax(Qout, 1)

# Our optimizer will try to optimize
nextQ = tf.placeholder(shape=[None, 4], dtype=tf.float32, name="target")
loss = tf.reduce_sum(tf.square(nextQ - Qout))

trainer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
updateModel = trainer.minimize(loss)

For our training, we need to reintroduce new options, like the randomness we had in our table before. To accomplish this, for every 10 predictions, we sample a random action (this is called an epsilon-greedy strategy, and we will reuse a variation of it later with the Atari games). Otherwise, we compute the actual Q value and we train our network to match this result (updating the dense layer weights):

# To keep track of our games and our results
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())

for i in range(num_episodes):
s = env.reset()

for j in range(100):
a, targetQ = sess.run([predict, Qout],
feed_dict={inputs:np.identity(16)[s:s+1]})
# We randomly choose a new state
# that we may have not encountered before
if np.random.rand(1) < e:
a[0] = env.action_space.sample()

s1, r, d, _ = env.step(a[0])

# Obtain the Q' values by feeding
# the new state through our network
Q1 = sess.run(Qout,
feed_dict={inputs:np.identity(16)[s1:s1 + 1]})
# Obtain maxQ' and set our target value for chosen action.
targetQ[0, a[0]] = r + y*np.max(Q1)

# Train our network using target and predicted Q values
sess.run(updateModel,
feed_dict={inputs:np.identity(16)[s:s+1], nextQ:targetQ})
s = s1
if d == True:
# Reduce chance of random action as we train the model.
e = 1 / ((i // 50) + 10)
break

With this strategy, the layer gets around 40% success, but this value is biased. If we plot the evolution of the reward (averaged through time over 20 episodes), the network improves drastically over time:

And the same happens with the alive time:

We can see that when the network started to get better at rewards, it also managed to keep the player alive longer. Unfortunately, the network is still not the best at this task. A human would take only eight steps to finish the game.

We can now use a similar strategy for the Atari games.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset