I\'ve written a container for a very simple piece of data that needs to be synchronized across threads. I want the top performance. I don\'t want to use locks.<
This code is completely broken.
The only reason this appears to work is that current compilers aren't very aggressive with reordering across atomic operations and x86 processors have pretty strong guarantees.
The first problem is that without synchronization, there is no guarantee that the client of this data structure will even see the fields of the node object to be initialized. The next issue is that without synchronization, the push operation can read arbitrarily old values for the head's tag.
We have developed a tool, CDSChecker, that simulates most behaviors that the memory model allows. It is open source and free. Run it on your data structure to see some interesting executions.
Proving anything about code that utilizes relaxed atomics is a big challenge at this point. Most proof methods break down because they are typically inductive in nature, and you don't have an order to induct on. So you get out of thin air read issues...