TD prediction

Both TD and MC use experience to solve z prediction problem. Given some policy π, both methods update their estimate v of v_π for the non-terminal states S_t occurring in that experience. Monte Carlo methods wait until the return following the visit is known, then use that return as a target for V(S_t).

The preceding method can be called as a constant - α MC, where MC must wait until the end of the episode to determine the increment to V(S_t) (only then is G_t known).

TD methods need to wait only until the next timestep. At time t+1, they immediately form a target and make a useful update using the observed reward R_t+1 and the estimate V(S_t+1). The simplest TD method, known as TD(0), is:

Target for MC update is G_t, whereas the target for the TD update is R_t+1+ y V(S_t+1).

In the following diagram, a comparison has been made between TD with MC methods. As we've written in equation TD(0), we use one step of real data and then use the estimated value of the value function of next state. In a similar way, we can also use two steps of real data to get a better picture of the reality and estimate value function of the third stage. However, as we increase the steps, which eventually need more and more data to perform parameter updates, the more time it will cost.

When we take infinite steps until it touches the terminal point for updating parameters in each episode, TD becomes the Monte Carlo method.

TD (0) for estimating v algorithm consists of the following steps:

Initialize:

Repeat (for each episode):
- Initialize S
- Repeat (for each step of episode):
  - A <- action given by π for S
  - Take action A, observe R,S'
Until S is terminal.

Table of Contents for TD prediction

Create new playlist

Sign In

Sign Up

Table of Contents for
TD prediction