The multi-arm bandit problem

Let me start with an analogy to understand this topic better. Do you like pizza? I like it a lot! My favorite restaurant in Bangalore serves delicious pizzas. I go to this place almost every time I feel like eating a pizza, and I am almost sure that I will get the best pizza. However, going to the same restaurant every time worries me that I am missing out on pizzas that are even better and served elsewhere in the town!

One alternative available is to try out restaurants one by one and sample the pizzas there, but this means that there is a very high probability that I will end up eating pizzas that aren't very nice. However, this is the one way for me to find a restaurant that serves better pizzas than the one I am currently aware of. I am aware you must be wondering why am I talking about pizzas when I am supposed to be talking about RL. Let me get to the point.

The dilemma with this task arises from incomplete information. In other words, to solve this task, it is essential to gather enough information to formulate the best overall strategy and then explore new actions. This will eventually lead to a minimization of overall bad experiences. This situation can otherwise be termed as the exploration versus exploitation dilemma:

Exploration versus exploitation dilemma

The preceding diagram aptly summarizes my best-pizza problem.

The multi-arm bandit problem (MABP) is a simplified form of the pizza analogy. It is used to represent similar kinds of problems, and finding a good strategy to solve them is already helping a lot of industries.

A bandit is defined as someone who steals your money! A one-armed bandit is a simple slot machine. We find this sort of machine in a casino: you insert a coin into the slot machine, pull a lever, and pray to the luck god to get an immediate reward. But the million-dollar question is why is a slot machine called a bandit? It turns out that all casinos configure the slot machines in such a way that all gamblers end up losing money!

A multi-arm bandit is a hypothetical but complicated slot machine where we have more than one slot machine lined up in a row. A gambler can pull several levers, with each lever giving a different return. The following diagram depicts the probability distribution for the corresponding reward that is different to each layer and unknown to the gambler:

Multi-arm bandit

Given these slot machines and after a set of initial trials, the task is to identify what lever to pull to get the maximum reward. In other words, pulling any one of the arms gives us a stochastic reward of either R=+1 for success, or R=0 for failure; this is called an immediate reward. A multi-arm bandit that issues a reward of 1 or 0 is called a Bernoulli. The objective is to pull the arms one-by-one in a sequence while gathering information to maximize the total payout over the long run. Formally, a Bernoulli MABP can be described as a tuple of (A,R), where the following applies:

We have KK machines with reward probabilities, {θ1,…,θK}.
At each time step, t, we take an action, a, on one slot machine and receive a reward, r.
A is a set of actions, each referring to the interaction with one slot machine. The value of action a is the expected reward, . If action a at time step t is on the i-th machine, then . Q(a) is generally referred to as the action-value function.
R is a reward function. In the case of the Bernoulli bandit, we observe a reward, r, in a stochastic fashion. At time step t, may return reward 1 with a probability of _, or 0 otherwise.

We can solve the MABP with multiple strategies. We will review some of the strategies shortly in this section. To decide on the best strategy and to compare the different strategies, we need a quantitative method. One method is to directly compute the cumulative rewards after a certain predefined number of trials. Comparing the cumulative rewards from each of the strategies gives us an opportunity to identify the best strategies for the problem.

At times, we may already know the best action for the given bandit problem. In those cases, it may be interesting to look at the concept of regret.

Let's imagine that we know of the details of the best arm to pull for the given bandit problem. Assume that by repeatedly pulling this best arm, we get a maximum expected reward, which is shown as a horizontal line in the following diagram:

Maximum reward obtained by pulling the best arm for a MABP

As per the problem statement, we need to make repeated trials by pulling different arms of the multi-arm bandit until we are approximately sure of the arm to pull for the maximum average return at time t. There are a number of rounds involved while we explore and decide upon the best arm. The number of rounds, otherwise called trials, also incurs some loss, and this is called regret. In other words, we want to maximize the reward even during the learning phase. Regret can be summarized as a quantification of exactly how much we regret not picking the optimal arm.

The following diagram is an illustration showing the regret due to trials done to find the best arm:

Concept of regret in MAB

Table of Contents for The multi-arm bandit problem

Create new playlist

Sign In

Sign Up

Table of Contents for
The multi-arm bandit problem