Developing a Multiarmed Bandit's Predictive Model

One of the simplest RL problems is called n-armed bandits. The thing is there are n-many slot machines but each has different fixed payout probability. The goal is to maximize the profit by always choosing the machine with the best payout.

As mentioned earlier, we will also see how to use policy gradient that produces explicit outputs. For our multiarmed bandits, we don't need to formalize these outputs on any particular state. To be simpler, we can design our network such that it will consist of just a set of weights that are corresponding to each of the possible arms to be pulled in the bandit. Then, we will represent how good an agent thinks to pull each arm to make maximum profit. A naive way is to initialize these weights to 1 so that the agent will be optimistic about each arm's potential reward.

To update the network, we can try choosing an arm with a greedy policy that we discussed earlier. Our policy is such that the agent receives a reward of either 1 or -1 once it has issued an action. I know this is not a realistic imagination but most of the time the agent will choose an action randomly that corresponds to the largest expected value.

We will start developing a simple but effective bandit agent incrementally for solving multiarmed bandit problems. At first, there will be no state, that is, we will have a stateless agent. Then, we will see that using a stateless bandit agent to solve a complex problem is so biased that we cannot use it in real life.

Then we will increase the agent complexity by adding or converting the sample bandits into contextual bandits. The contextual bandits then will be a state full agent so can solve our predicting problem more efficiently. Finally, we will further increase the agent complexity by converting the textual bandits to full RL agent before deploying it:

  1. Load the required library.

    Load the required library and packages/modules needed:

    import tensorflow as tf
    import tensorflow.contrib.slim as slim
    import numpy as np
  2. Defining bandits.

    For this example, I am using a four-armed bandit. The getBandit function that generates a random number from a normal distribution has the mean of 0. The lower the Bandit number, the more likely a positive reward will be awarded. As stated earlier, this is just a naïve but greedy way to train the agent so that it learns to choose a bandit that will generate not only the positive but also the maximum reward. Here I have listed the bandits so that Bandit 4 most often provides a positive reward:

    def getBandit(bandit):
        ''
        This function creates the reward to the bandits on the basis of randomly generated numbers. It then returns either a positive or negative reward.
        '' 
        random_number = np.random.randn(1)
        if random_number > bandit:   
            return 1
        else:
            return -1
  3. Developing an agent for the bandits.

    The following code creates a very simple neural agent consisting of a set of values for each of the bandits. Each value is estimated to be 1 based on the return value from the bandits. We use a policy gradient method to update the agent by moving the value for the selected action toward the received reward. At first, we need to reset the graph as follows:

    tf.reset_default_graph()

    Then, the next two lines do the actual choosing by establishing the feed-forward part of the network:

    weight_op = tf.Variable(tf.ones([num_bandits]))
    action_op = tf.argmax(weight_op,0)

    Now, before starting the training process, we need to initiate the training process itself. Since we already know the reward, now it's time to feed them and choose an action in the network to compute the loss and use it to update the network:

    reward_holder = tf.placeholder(shape=[1],dtype=tf.float32)
    action_holder = tf.placeholder(shape=[1],dtype=tf.int32)
    responsible_weight = tf.slice(weight_op,action_holder,[1])

    We need to define the objective function that is loss:

    loss = -(tf.log(responsible_weight)*reward_holder)

    And then let's make the training process slow to make it exhaustive utilizing the learning rate:

    LR = 0.001

    We then use the gradient descent optimizer and instantiate the training operation:

    optimizer = tf.train.GradientDescentOptimizer(learning_rate=LR)
    training_op = optimizer.minimize(loss)

    Now, it's time to define the training parameters such as a total number of iterations to train the agent, reward function, and a random action. The reward here sets the scoreboard for bandits to 0, and by choosing a random action, we set the probability of taking a random action:

    total_episodes = 10000
    total_reward = np.zeros(num_bandits) 
    chance_of_random_action = 0.1 

    Finally, we initialize the global variables:

    init_op = tf.global_variables_initializer() 
  4. Training the agent.

    We need to train the agent by taking actions to the environment and receiving rewards. We start by creating a TensorFlow session and launch the TensorFlow graph. Then, iterate the training process up to a total number of iterations. Then, we choose either a random act or one from the network. We then compute the reward from picking one of the bandits. Then, we make the training process consistent and update the network. Finally, we update the scoreboard:

    with tf.Session() as sess:
        sess.run(init_op)
        i = 0
        while i < total_episodes:        
            if np.random.rand(1) < chance_of_random_action:
                action = np.random.randint(num_bandits)
            else:
                action = sess.run(action_op)
                    reward = getBandit(bandits[action])         
                _,resp,ww = sess.run([training_op,responsible_weight,weight_op], feed_dict={reward_holder:[reward],action_holder:[action]})
            total_reward[action] += reward
            if i % 50 == 0:
                print("Running reward for all the " + str(num_bandits) + " bandits: " + str(total_reward))
            i+=1

    Now let's evaluate the above model as follows:

    print("The agent thinks bandit " + str(np.argmax(ww)+1) + " would be the most efficient one.")
    if np.argmax(ww) == np.argmax(-np.array(bandits)):
        print(" and it was right at the end!")
    else:
        print(" and it was wrong at the end!")
    >>>

    The first iteration generates the following output:

    Running reward for all the 4 bandits: [-1. 0. 0. 0.]
    Running reward for all the 4 bandits: [ -1. -2. 14. 0.]
    …
    Running reward for all the 4 bandits: [ -15. -7. 340. 21.]
    Running reward for all the 4 bandits: [ -15. -10. 364. 22.]
    The agent thinks Bandit 3 would be the most efficient one and it was wrong at the end!

    The second iteration generates a different result as follows:

    Running reward for all the 4 bandits: [ 1. 0. 0. 0.]
    Running reward for all the 4 bandits: [ -1. 11. -3. 0.]
    Running reward for all the 4 bandits: [ -2. 1. -2. 20.]
    …
    Running reward for all the 4 bandits: [ -7. -2. 8. 762.]
    Running reward for all the 4 bandits: [ -8. -3. 8. 806.]
    The agent thinks Bandit 4 would be the most efficient one and it was right at the end!

    Now that if you see the limitation of this agent being a stateless agent so randomly predicts which bandits to choose. In that situation, there are no environmental states, and the agent must simply learn to choose which action is best to take. To get rid of this problem, we can think of developing contextual bandits.

    Using the contextual bandits, we can introduce and make the proper utilization of the state. The state consists of an explanation of the environment that the agent can use to make more intelligent and informed actions. The thing is that instead of using a single bandit we can chain multiple bandits together. So what would be the function of the state? Well, the state of the environment tells the agent to choose a bandit from the available list. On the other hand, the goal of the agent is to learn the best action for any number of bandits.

    This way, the agent faces an issue since each bandit may have different reward probabilities for each arm and agent needs to learn how to perform an action on the state of the environment. Otherwise, the agent cannot achieve the maximum reward possible:

    Developing a Multiarmed Bandit's Predictive Model

    Figure 5: Stateless versus contextual bandits

    As mentioned earlier, to get rid of this issue, we can build a single-layer neural network so that it can take a state and yield an action. Now, similar to the random bandits, we can use a policy-gradient update method too so that the network update is easier to take actions for maximizing the reward. This simplified way of posting an RL problem is referred to as the contextual bandit.

  5. Developing contextual bandits.

    This example was adopted and extended based on "Simple Reinforcement Learning with TensorFlow Part 1.5: Contextual Bandits" By Arthur Juliani published at https://medium.com/.

    At first, let's define our contextual bandits. For this example, we will see how to use three four-armed bandits, that is, each Bandit has four arms that can be pulled to make an action. Since each bandit is contextual and has a state, so their arms have different success probabilities. This requires different actions to be performed to yield the best predictive result.

    Here, we define a class named contextual_bandit() consisting of a constructor and two user defined functions: getBandit() and pullArm(). The getBandit() function generates a random number from a normal distribution with a mean of 0. The lower the Bandit number, the more likely a positive reward will be returned to be utilized. We want our agent to learn to choose the banditarm that will most often give a positive reward. Of course, it depends on the bandit presented. This constructor lists out all of our bandits. We assume the current state being armed 4, 2, 3, and 1 that is the most optimal respectively.

    Also, if you see carefully, most of the reinforcement learning algorithms follow similar implementation patterns. Thus, it's a good idea to create a class with the relevant methods to reference later, such as an abstract class or interface:

    class contextualBandit():
        def __init__(self):
            self.state = 0        
            self.bandits = np.array([[0.2,0,-0.0,-5], [0.1,-5,1,0.25], [0.3,0.4,-5,0.5], [-5,5,5,5]])
            self.num_bandits = self.bandits.shape[0]
            self.num_actions = self.bandits.shape[1]
            def getBandit(self):        
            '''
            This function returns a random state for each episode.
            '''
            self.state = np.random.randint(0, len(self.bandits)) 
            return self.state
            def pullArm(self,action):        
            '''
            This function creates the reward to the bandits on the basis of randomly generated numbers. It then returns either a positive or negative reward that is action
            ''' 
            bandit = self.bandits[self.state, action]
            result = np.random.randn(1)
            if result > bandit:
                return 1
            else:
                return -1
  6. Developing a policy-based agent.

    The following class ContextualAgent helps develop our simple, but very effective neural and contextual agent. We supply the current state as input and it then returns an action that is conditioned on the state of the environment. This is the most important step toward making a stateless agent a stateful one to be able to solve a full RL problem.

    Here, I tried to develop this agent such that it uses a single set of weights for choosing a particular arm given a bandit. The policy gradient method is used to update the agent by moving the value for a particular action toward achieving maximum reward:

    class ContextualAgent():
        def __init__(self, lr, s_size,a_size):
            '''
            This function establishes the feed-forward part of the network. The agent takes a state and produces an action -that is. contextual agent
            ''' 
            self.state_in= tf.placeholder(shape=[1], dtype=tf.int32)
            state_in_OH = slim.one_hot_encoding(self.state_in, s_size)
            output = slim.fully_connected(state_in_OH, a_size,biases_initializer=None, activation_fn=tf.nn.sigmoid, weights_initializer=tf.ones_initializer())
            self.output = tf.reshape(output,[-1])
            self.chosen_action = tf.argmax(self.output,0)
            self.reward_holder = tf.placeholder(shape=[1], dtype=tf.float32)
            self.action_holder = tf.placeholder(shape=[1], dtype=tf.int32)
            self.responsible_weight = tf.slice(self.output, self.action_holder,[1])
            self.loss = -(tf.log(self.responsible_weight)*self.reward_holder)
            optimizer = tf.train.GradientDescentOptimizer(learning_rate=lr)
            self.update = optimizer.minimize(self.loss)
  7. Training the contextual bandit agent.

    At first, we clear the default TensorFlow graph:

    tf.reset_default_graph()

    Then, we define some parameters that will be used to train the agent:

    lrarning_rate = 0.001 # learning rate 
    chance_of_random_action = 0.1 # Chance of a random action.
    max_iteration = 10000 #Max iteration to train the agent.

    Now, before starting the training, we need to load the bandits and then our agent:

    contextualBandit = contextualBandit() #Load the bandits.
    contextualAgent = ContextualAgent(lr=lrarning_rate, s_size=contextualBandit.num_bandits, a_size=contextualBandit.num_actions) #Load the agent.

    Now, to maximize the objective function toward total rewards, weights is used to evaluate to look into the network. We also set the scoreboard for bandits to 0 initially:

    weights = tf.trainable_variables()[0] 
    total_reward = np.zeros([contextualBandit.num_bandits,contextualBandit.num_actions])

    Then, we initialize all the variables using global_variables_initializer()function:

    init_op = tf.global_variables_initializer()

    Finally, we will start the training. The training is similar to the random one we have done in the preceding example. However, here the main objective of the training is to compute the mean reward for each of the bandits so that we can evaluate the agent's prediction accuracy later on by utilizing them:

    with tf.Session() as sess:
        sess.run(init_op)
        i = 0
        while i < max_iteration:
            s = contextualBandit.getBandit() #Get a state from the environment.
            #Choose a random action or one from our network.
            if np.random.rand(1) < chance_of_random_action:
                action = np.random.randint(contextualBandit.num_actions)
            else:
                action = sess.run(contextualAgent.chosen_action,feed_dict={contextualAgent.state_in:[s]})
            reward = contextualBandit.pullArm(action) #Get our reward for taking an action given a bandit.
            #Update the network.
            feed_dict={contextualAgent.reward_holder:[reward],contextualAgent.action_
    holder:[action],contextualAgent.state_in:[s]}
            _,ww = sess.run([contextualAgent.update,weights], feed_dict=feed_dict)        
            #Update our running tally of scores.
            total_reward[s,action] += reward
            if i % 500 == 0:
                print("Mean reward for each of the " + str(contextualBandit.num_bandits) + " bandits: " + str(np.mean(total_reward,axis=1)))
            i+=1
    >>>
    Mean reward for each of the 4 bandits: [ 0. 0. -0.25 0. ]
    Mean reward for each of the 4 bandits: [ 25.75 28.25 25.5 28.75]
    …
    Mean reward for each of the 4 bandits: [ 488.25 489. 473.5 440.5 ]
    Mean reward for each of the 4 bandits: [ 518.75 520. 499.25 465.25]
    Mean reward for each of the 4 bandits: [ 546.5 547.75 525.25 490.75]
  8. Evaluating the agent.

    Now, that we have the mean reward for all the four bandits, it's time to utilize them to predict something interesting, that is, which bandit's arm will maximize the reward. Well, at first we can initialize some variables to estimate the prediction accuracy as well:

    right_flag = 0
    wrong_flag = 0

    Then let's start evaluating the agent's prediction performance:

    for a in range(contextualBandit.num_bandits):
        print("The agent thinks action " + str(np.argmax(ww[a])+1) + " for bandit " + str(a+1) + " would be the most efficient one.")
        if np.argmax(ww[a]) == np.argmin(contextualBandit.bandits[a]):
            right_flag += 1
            print(" and it was right at the end!")
        else:	
            print(" and it was wrong at the end!")
            wrong_flag += 1
    >>>
    The agent thinks action 4 for Bandit 1 would be the most efficient one and it was right at the end!
    The agent thinks action 2 for Bandit 2 would be the most efficient one and it was right at the end!
    The agent thinks action 3 for Bandit 3 would be the most efficient 
    ne and it was right at the end!
    The agent thinks action 1 for Bandit 4 would be the most efficient one and it was right at the end!

    As you can see, all the predictions made are right predictions. Now we can compute the accuracy as follows:

    prediction_accuracy = (right_flag/right_flag+wrong_flag)
    print("Prediction accuracy (%):", prediction_accuracy * 100)
    >>>
    Prediction accuracy (%): 100.0

Fantastic, well done! We have managed to design and develop a more robust bandit agent by means of a contextual agent that can accurately predict which arm, that is, the action of a bandit that would help to achieve the maximum reward, that is, profit.

In the next section, we will see another interesting but very useful application for stock price prediction, where we will see how to develop a policy-based Q Learning agent out of the box of the RL.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset