I am new to programming in general so please keep that in mind when you answer my question.
I have a program that takes a large 3D array (1 billion elements) and sums up
Eliminate False Sharing
This is where is multiple cores are blocking on each other trying to read or update different memory addresses that share the same block cache. Processor cache locking is per block, and only one thread can write to that block at once.
Herb Sutter has a very good article on False Sharing, how to discover it and how to avoid it in your parallel algorithms.
Obviously he has loads of other excellent articals on concurrent programming too, see his blog.