TD prediction

Both TD and MC use experience to solve z prediction problem. Given some policy π, both methods update their estimate v of vπ  for the non-terminal states St occurring in that experience. Monte Carlo methods wait until the return following the visit is known, then use that return as a target for V(St).

The preceding method can be called as a constant - α MC, where MC must wait until the end of the episode to determine the increment to V(St) (only then is Gt known).

TD methods need to wait only until the next timestep. At time t+1, they immediately form a target and make a useful update using the observed reward Rt+1 and the estimate V(St+1). The simplest TD method, known as TD(0), is:

Target for MC update is Gt, whereas the target for the TD update is Rt+1 + y V(St+1).

In the following diagram, a comparison has been made between TD with MC methods. As we've written in equation TD(0), we use one step of real data and then use the estimated value of the value function of next state. In a similar way, we can also use two steps of real data to get a better picture of the reality and estimate value function of the third stage. However, as we increase the steps, which eventually need more and more data to perform parameter updates, the more time it will cost.

When we take infinite steps until it touches the terminal point for updating parameters in each episode, TD becomes the Monte Carlo method.

TD (0) for estimating v algorithm consists of the following steps:

  1. Initialize:
  1. Repeat (for each episode):
    • Initialize S
    • Repeat (for each step of episode):
      • A <- action given by π for S
      • Take action A, observe R,S'
  2. Until S is terminal.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset