问题
I have a read few papers and lectures on temporal difference learning (some as they pertain to neural nets, such as the Sutton tutorial on TD-Gammon) but I am having a difficult time understanding the equations, which leads me to my questions.
-Where does the prediction value V_t come from? And subsequently, how do we get V_(t+1)?
-What exactly is getting back propagated when TD is used with a neural net? That is, where does the error that gets back propagated come from when using TD?
回答1:
The backward and forward views can be confusing, but when you are dealing with something simple like a game-playing program, things are actually pretty simple in practice. I'm not looking at the reference you're using, so let me just provide a general overview.
Suppose I have a function approximator like a neural network, and that it has two functions, train
and predict
for training on a particular output and predicting the outcome of a state. (Or the outcome of taking an action in a given state.)
Suppose I have a trace of play from playing a game, where I used the predict
method to tell me what move to make at each point and suppose that I lose at the end of the game (V=0). Suppose my states are s_1, s_2, s_3...s_n.
The monte-carlo approach says that I train my function approximator (e.g. my neural network) on each of the states in the trace using the trace and the final score. So, given this trace, you would do something like call:
train(s_n, 0)
train(s_n-1, 0)
...
train(s_1, 0)
.
That is, I'm asking every state to predict the final outcome of the trace.
The dynamic programming approach says that I train based on the result of the next state. So my training would be something like
train(s_n, 0)
train(s_n-1, test(s_n))
...
train(s_1, test(s_2))
.
That is, I'm asking the function approximator to predict what the next state predicts, where the last state predicts the final outcome from the trace.
TD learning mixes between the two of these, where λ=1
corresponds to the first case (monte carlo) and λ=0
corresponds to the second case (dynamic programming). Suppose that we use λ=0.5
. Then our training would be:
train(s_n, 0)
train(s_n-1, 0.5*0 + 0.5*test(s_n))
train(s_n-2, 0.25*0 + 0.25*test(s_n) + 0.5*test(s_n-1)+)
...
Now, what I've written here isn't completely correct, because you don't actually re-test the approximator at each step. Instead you just start with a prediction value (V = 0
in our example) and then you update it for training the next step with the next predicted value. V = λ·V + (1-λ)·test(s_i)
.
This learns much faster than monte carlo and dynamic programming methods because you aren't asking the algorithm to learn such extreme values. (Ignoring the current prediction or ignoring the final outcome.)
来源:https://stackoverflow.com/questions/23235181/neural-network-and-temporal-difference-learning