In reinforcement learning, what is the difference between policy iteration and value iteration?
As much as I understand, in value iteration, you use t
The basic difference is -
In Policy Iteration - You randomly select a policy and find value function corresponding to it , then find a new (improved) policy based on the previous value function, and so on this will lead to optimal policy .
In Value Iteration - You randomly select a value function , then find a new (improved) value function in an iterative process, until reaching the optimal value function , then derive optimal policy from that optimal value function .
Policy iteration works on principle of “Policy evaluation —-> Policy improvement”.
Value Iteration works on principle of “ Optimal value function —-> optimal policy”.