I need some idea how to write a C++ cross platform implementation of a few parallelizable problems in a way so I can take advantage of SIMD (SSE, SPU, etc) if available. As well
You might want to take a glance at my attempt at SIMD/non-SIMD:
vrep, a templated base class with specializations for SIMD (note how it distinguishes between floats-only SSE, and SSE2, which introduced integer vectors.).
More useful v4f, v4i etc classes (subclassed via intermediate v4).
Of course it's far more geared towards 4-element vectors for rgba/xyz type calculations than SoA, so will completely run out of steam when 8-way AVX comes along, but the general principles might be useful.