I need some idea how to write a C++ cross platform implementation of a few parallelizable problems in a way so I can take advantage of SIMD (SSE, SPU, etc) if available. As well
You might want to look at the source for the MacSTL library for some ideas in this area: www.pixelglow.com/macstl/