Utility

The long-term reward is called a utility. It turns out that if we know the utility of performing an action upon a state, then it is easy to solve RL. For example, to decide which action to take, we simply select the action that produces the highest utility. However, uncovering these utility values is difficult. The utility of performing an action a at a state s is written as a function, Q(s, a), called the utility function. This predicts the expected immediate reward, and rewards following an optimal policy given the state-action input, as shown in Figure 4:

Figure 4: Using a utility function

Most RL algorithms boil down to just three main steps: infer, do, and learn. During the first step, the algorithm selects the best action (a) given a state (s) using the knowledge it has so far. Next, it perform the action to find out the reward (r) as well as the next state (s'). Then it improves its understanding of the world using the newly acquired knowledge (s, r, a, s'). However,  as I think you will agree, this is just a naive way to calculate the utility.

Now, the question is: what could be a more robust way to compute it? We can calculate the utility of a particular state-action pair (s, a) by recursively considering the utilities of future actions. The utility of your current action is influenced not only by the immediate reward but also the next best action, as shown in the following formula:

s' denotes the next state, and a' denotes the next action. The reward of taking action a in state s is denoted by r(s, a). Here, γ is a hyperparameter that you get to choose, called the discount factor. If γ is 0, then the agent chooses the action that maximizes the immediate reward. Higher values of γ will make the agent give more importance to considering long-term consequences. In practice, we have more such hyperparameter to be considered. For example, if a vacuum cleaner robot is expected to learn to solve tasks quickly, but not necessarily optimally, we may want to set a faster learning rate.

Alternatively, if a robot is allowed more time to explore and exploit, we may tune down the learning rate. Let us call the learning rate α and change our utility function as follows (note that when α = 1, both the equations are identical):

In summary, an RL problem can be solved if we know this Q(s, a) function. Here comes an algorithm called Q-learning.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset