问题
I am trying to dive deep into volatile
keyword in Java and setup 2 testing environments. I believe both of them are with x86_64 and use hotspot.
Java version: 1.8.0_232
CPU: AMD Ryzen 7 8Core
Java version: 1.8.0_231
CPU: Intel I7
Code is here:
import java.lang.reflect.Field;
import sun.misc.Unsafe;
public class Test {
private boolean flag = true; //left non-volatile intentionally
private volatile int dummyVolatile = 1;
public static void main(String[] args) throws Exception {
Test t = new Test();
Field f = Unsafe.class.getDeclaredField("theUnsafe");
f.setAccessible(true);
Unsafe unsafe = (Unsafe) f.get(null);
Thread t1 = new Thread(() -> {
while (t.flag) {
//int b = t.someValue;
//unsafe.loadFence();
//unsafe.storeFence();
//unsafe.fullFence();
}
System.out.println("Finished!");
});
Thread t2 = new Thread(() -> {
t.flag = false;
unsafe.fullFence();
});
t1.start();
Thread.sleep(1000);
t2.start();
t1.join();
}
}
"Finished!" is never printed which does not make sense to me. I am expecting the fullFence
in thread 2 makes the flag = false
globally visible.
From my research, Hotspot uses lock/mfence
to implement fullFence
on x86. And according to Intel's instruction-set reference manual entry for mfence
This serializing operation guarantees that every load and store instruction that precedes the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows the MFENCE instruction.
Even "worse", if I comment out fullFence
in thread 2 and un-comment any one of the xxxFence
in thread 1, the code prints out "Finished!" This makes even less sense, because at least lfence is "useless"/no-op in x86.
Maybe my source of information contains inaccuracy or i am misunderstanding something. Please help, thanks!
回答1:
It's not the runtime effect of the fence that matters, it's the compile-time effect of forcing the compiler to reload stuff.
Your t1
loop contains no volatile
reads or anything else that could synchronize-with another thread, so there's no guarantee it will ever notice any changes to any variables. i.e. when JITing into asm, the compiler can make a loop that loads the value into a register once, instead of reloading it from memory every time. This is the kind of optimization you always want the compiler to be able to do for non-shared data, which is why the language has rules that let it do this when there's no possible synchronization.
And then of course the condition can get hoisted out of the loop. So with no barriers or anything, your reader loop can JIT into asm that implements this logic:
if(t.flag) {
for(;;){} // infinite loop
}
Besides ordering, the other part of Java volatile
is the assumption that other threads may change it asynchronously, so multiple reads can't be assumed to give the same value.
But unsafe.loadFence();
makes the JVM reload t.flag
from (cache-coherent) memory every iteration. I don't know if this is required by the Java spec or merely an implementation detail that makes it happen to work.
If this was C++ with a non-atomic
variable (which would be undefined behaviour in C++), you'd see exactly the same effect in a compiler like GCC. _mm_lfence
would also be a compile-time full-barrier as well as emitting a useless lfence
instruction, effectively telling the compiler that all memory might have changed and thus needs to be reloaded. So it can't reorder loads across it, or hoist them out of loops.
BTW, I wouldn't be so sure that unsafe.loadFence()
even JITs to an lfence
instruction on x86. It is useless for memory ordering (except for very obscure stuff like fencing NT loads from WC memory, e.g. copying from video RAM, which the JVM can assume isn't happening), so a JVM JITing for x86 could just treat it as a compile-time barrier. Just like what C++ compilers do for std::atomic_thread_fence(std::memory_order_acquire);
- block compile time reordering of loads across the barrier, but emit no asm instructions because the asm memory of the host running the JVM is already strong enough.
In thread 2, unsafe.fullFence();
is I think useless. It just makes that thread wait until earlier stores become globally visible, before any later loads/stores can happen. t.flag = false;
is a visible side effect that can't be optimized away so it definitely happens in the JITed asm whether there's a barrier following it or not, even though it's not volatile
. And it can't be delayed or merged with something else because there's nothing else in the same thread.
Asm stores always become visible to other threads, the only question is whether the current thread waits for its store buffer to drain or not before doing more stuff (especially loads) in this thread. i.e. prevent all reordering, including StoreLoad. Java volatile
does that, like C++ memory_order_seq_cst
(by using a full barrier after every store), but without a barrier it's still a store like C++ memory_order_relaxed
. (Or when JITing x86 asm, loads/stores are actually as strong as acquire/release.)
Caches are coherent, and the store buffer always drains itself (committing to L1d cache) as fast as it can to make room for more stores to execute.
Caveat: I don't know a lot of Java, and I don't know exactly how unsafe / undefined it is to assign a non-volatile
in one thread and read it in another with no synchronization. Based on the behaviour you're seeing, it sounds exactly like what you'd see in C++ for the same thing with non-atomic
variables (with optimization enabled, like HotSpot always does)
(Based on @Margaret's comment, I updated with some guesswork about how I assume Java synchronization works. If I mis-stated anything, please edit or comment.)
In C++ data races on non-atomic
vars are always Undefined Behaviour, but of course when compiling for real ISAs (which don't do hardware race-prevention) the results are sometimes what people wanted.
来源:https://stackoverflow.com/questions/59692233/why-does-unsafe-fullfence-not-ensuring-visibility-in-my-example