How is Q-learning different from value iteration in reinforcement learning?
I know Q-learning is model-free and training samples are transitions (s, a, s\', r)
I don't think the accepted answer captured the essential of difference. To quote the newest version of Richard Sutton's book:
" Having q∗ makes choosing optimal actions even easier. With q∗, the agent does not even have to do a one-step-ahead search: for any state s, it can simply find any action that maximizes q∗(s; a). The action-value function effectively caches the results of all one-step-ahead searches. It provides the optimal expected long-term return as a value that is locally and immediately available for each state{action pair. Hence, at the cost of representing a function of state{action pairs, instead of just of states, the optimal action value function allows optimal actions to be selected without having to know anything about possible successor states and their values, that is, without having to know anything about the environment’s dynamics. "
Usually in real problems the agent doesn't know the world(or the so called transformation) dynamics but we definitely know the rewards, because those are what the environment gives back during the interaction and the reward function is actually defined by us.
The real difference between q-learning and normal value iteration is that: After you have V*, you still need to do one step action look-ahead to subsequent states to identify the optimal action for that state. And this look-ahead requires the transition dynamic after the action. But if you have q*, the optimal plan is just choosing a from the max q(s,a) pair.