问题
This question is an attempt to reframe this question to make it clearer.
This slide shows an equation for Q(state, action) in terms of a set of weights and feature functions.
These discussions (The Basic Update Rule and Linear Value Function Approximation) show a set of weights for each action.
The reason they are different is that the first slide assumes you can anticipate the result of performing an action and then find features for the resulting states. (Note that the feature functions are functions of both the current state and the anticipated action.) In that case, the same set of weights can be applied to all the resulting features.
But in some cases, one can't anticipate the effect of an action. Then what does one do? Even if one has perfect weights, one can't apply them to the results of applying the actions if one can't anticipate those results.
My guess is that the second pair of slides deals with that problem. Instead of performing an action and then applying weights to the features of the resulting states, compute features of the current state and apply possibly different weights for each action.
Those are two very different ways of doing feature-based approximation. Are they both valid? The first one makes sense in situations, e.g., like Taxi, in which one can effectively simulate what the environment will do at each action. But in some cases, e.g., cart-pole, that's not possible/feasible. Then it would seem you need a separate set of weights for each action.
Is this the right way to think about it, or am I missing something?
Thanks.
来源:https://stackoverflow.com/questions/53398440/in-reinforcement-learning-using-feature-approximation-does-one-have-a-single-se