Here is a succinct answer: a policy is the 'thinking' of the agent. It's the mapping of when you are in some state s
, which action a
should the agent take now? You can think of policies as a lookup table:
state----action----probability/'goodness' of taking the action
1 1 0.6
1 2 0.4
2 1 0.3
2 2 0.7
If you are in state 1, you'd (assuming a greedy strategy) pick action 1. If you are in state 2, you'd pick action 2.