I am new to programming in general so please keep that in mind when you answer my question.
I have a program that takes a large 3D array (1 billion elements) and sums up
Your computer system typically has some elements that limit the rough performance. Which part is your limiting elements, depends on the concrete situation. Normally one of the following factors may be the cause of your performance problems.
Disk I/O bandwidth: In most enterprise applications the sheer size of data processed requires it to be stored in some database. Acessing this data may be slowed down by both: the maximum transfer speed, but very often the biggest impact will be caused by a big number of small disk accesses reading some blocks here and there. The you will see the latency time of the heads of the disks moving around and even the time the disk requires for a full rotation may limit your application. Long times ago i had a real problem using some expansive SUN E430 installation that was outperformed by my small NeXTstation... It was the constant fsync()ing of my database which was slowed down by disks not caching write accesses (for good reason). Normally you can speed up your system by adding additional disks to get more I/O per second. Dedicating your drives to specific tasks may even do better in some cases.
Network Latency: nearly everything that affects application speed said for disks is equivalent for Network I/O.
RAM: If your RAM is not big enough to store your complete application image you need to store it on an external disks. Therefore the Disk I/O slowdown bites you again.
CPU processing speed (either Integer or floating point): CPU processing power is the next factor that is a limit for CPU intensive tasks. A CPU has a physical speed limit that cannot be outreached. The only way to speed up is to add more CPU.
These limits may help you to find an answer for your specific problem.
Do you need simply more processing power and your system has more than one CPU or Core? In that case multithreading will improve your performance.
Do you observe significant Network or Disk Latency? If you see this, your valuable CPU might throw away CPU cycles waiting for some slow I/O. If more that one thread is active, this thread might find all data required for processing in memory and could pick up these otherwise wasted CPU cycles.
Therefore you need to observe your existing application. try to extimate the memory bandwidth of the data shuffled around. If the application is active on one CPU below 100%, you might have reached the memory bandwidth limit. In that case, additional threading will do no good for you because this does not give you mor bandwidth from memory.
If the CPU is at 100%, give it a try, but have a look at the algorithms. Multi-threading will add additional overhead for synchronization (and complexity, tons of complexity) that might slightly reduce the memory bandwidth. Prefer alorithms that can be implemented avoiding fine grained synchronizations.
If you see I/O wait times, think about clever partitioning or caching and then about threading. There is a reason why GNU-make supported parallel build back in the 90's :-)
The problem domain you've described leads me to gav a look at clever algorithms first. Try to using sequential read/write operations on main memory as much as possible to support the CPU and memory subsystems as much as possible. Keep operations "local" and datastructures as small and optimzed as possible to reduce the amount of memory that needs to be shuffled around before switching to a second core.