问题
This slide shows an equation for Q(state, action) in terms of a set of weights and feature functions. I'm confused about how to write the feature functions.
Given an observation, I can understand how to extract features from the observation. But given an observation, one doesn't know what the result of taking an action will be on the features. So how does one write a function that maps an observation and an action to a numerical value?
In the Pacman example shown a few slides later, one knows, given a state, what the effect of an action will be. But that's not always the case. For example, consider the cart-pole problem (in OpenAI gym). The features (which are, in fact, what the observation consists of) are four values: cart position, cart velocity, pole angle, and pole rotational velocity. There are two actions: push left, and push right. But one doesn't know in advance how those actions will change the four feature values. So how does one compute Q(s, a)? That is, how does one write the feature functions fi(state, action)?
Thanks.
回答1:
How you select actions depends on your algorithm and your exploration strategy. For example, in Q learning you can do something called epsilon greedy exploration. Espilon % of the time you select an action at random and the other % of the time you take the action with the highest expected value (the greedy action).
So how does one write a function that maps an observation and an action to a numerical value?
By using rewards you can approximate state, action values. Then use the rewards and (depending on the algorithm) the value of the next state. For example a Q learning update formula:
You update the old Q(s,a) value with the reward and your estimate of the optimal future value from the next state.
In tabular Q learning you can estimate each Q(s,a) value individually and update the value everytime you visit a state and take an action. In function approximation Q learning you use something like a neural net to approximate the values of Q(s,a). When choosing what action to select you enter the state and action into the neural net and get back the neural net's approximate values of each action. Then pick the action based on your algorithm (like the epsilon greedy method). As your agent interacts with the environment, you train and update the neural net with the new data to improve the function approximation.
来源:https://stackoverflow.com/questions/53077399/when-using-functional-approximation-in-reinforcement-learning-how-does-one-selec