In plain words, in the simplest case, a policy π
is a function that takes as input a state s
and returns an action a
. That is: π(s) → a
In this way, the policy is typically used by the agent to decide what action a
should be performed when it is in a given state s
.
Sometimes, the policy can be stochastic instead of deterministic. In such a case, instead of returning a unique action a
, the policy returns a probability distribution over a set of actions.
In general, the goal of any RL algorithm is to learn an optimal policy that achieve a specific goal.