On a system with one CPU, Peterson's algorithm is guaranteed to work, because program's own behavior is observed in program order.
On systems with multiple CPUs the algorithm may fail to work because the program order of events occurring on one CPU may be perceived differently on another.
That may cause early entering into the critical section (while it's still in use by another thread) and early exiting from the critical section (when the work with the shared resource hasn't been finished yet).
For this reason one needs to ensure that the order of events occurring inside the CPU is seen the same outside. Memory barriers can ensure this serialization.