I am new to programming in general so please keep that in mind when you answer my question.
I have a program that takes a large 3D array (1 billion elements) and sums up
It's a matrix problem?
Both Intel and AMD have super-optimized libraries for all sorts of heavy math problems. These libraries use threading, arrange the data for best cache use, cache prefetch, SSE vector instructions. Everything.
I believe you have to pay for the libraries, but they are well worth the money.