Observe the following program written in Java (complete runnable version follows, but the important part of the program is in the snippet a little bit further below):
You are not actually writing to a volatile field so the volatile field can be cached in each thread.
Using volatile prevents some compiler optimisations and in a micro-benchmark, you can see a large relative difference.
In the example above, the commented out version is longer because it has loop unrolled to place two iterations in one actual loop. This can almost double performance.
When using volatile you can see there is no loop unrolling.
BTW: You can remove a lot of the code in your example to make it easier to read. ;)