As someone in the world of HPC who came from the world of enterprise web development, I\'m always curious to see how developers back in the \"real world\" are taking advanta
I'm taking advantage of multicore using C, PThreads, and a home brew implementation of Communicating Sequential Processes on an OpenVPX platform with Linux using the PREEMPT_RT patch set's scheduler. It all adds up to nearly 100% CPU utilisation across multiple OS instances with no CPU time used for data exchange between processor cards in the OpenVPX chassis, and very low latency too. Also using sFPDP to join multiple OpenVPX chassis together into a single machine. Am not using Xeon's internal DMA so as to relieve memory pressure inside CPUs (DMA still uses memory bandwidth at the expense of the CPU cores). Instead we're leaving data in place and passing ownership of it around in a CSP way (so not unlike the philosophy of .NET's task parallel data flow library).
1) Software Roadmap - we have pressure to maximise the use real estate and available power. Making the very most of the latest hardware is essential
2) Software domain - effectively Scientific Computing
3) What we're doing with existing code? Constantly breaking it apart and redistributing parts of it across threads so that each core is maxed out doing the most it possibly can without breaking out real time requirement. New hardware means quite a lot of re-thinking (faster cores can do more in the given time, don't want them to be under utilised). Not as bad as it sounds - the core routines are very modular so easily assembled into thread-sized lumps. Although we planned on taking control of thread affinity away from Linux, we've not yet managed to extract significant extra performance by doing so. Linux is pretty good at getting data and code in more or less the same place.
4) In effect already there - total machine already adds up to thousands of cores
5) Parallel computing is essential - it's a MISD system.
If that sounds like a lot of work, it is. some jobs require going whole hog on making the absolute most of available hardware and eschewing almost everything that is high level. We're finding that the total machine performance is a function of CPU memory bandwidth, not CPU core speed, L1/L2/L3 cache size.