Assume there are two threads running on x86 CPU0 and CPU1 respectively. Thread running on CPU0 executes the following commands:
A=1
B=1
Cac
In x86, writes by a single processor are observed in the same order by all processors. No need to fence in your example, nor in any normal program on x86. Your program:
while(B==0); // wait for B == 1 to become globally observable
print A; // now, A will always be 1 here
What exactly happens in cache is model specific. All kinds of tricks and speculative behavior can occur in cache, but the observable behavior always follows the rules.
See Intel System Programming Guide Volume 3 section 8.2.2. for the details on memory ordering.