I\'m looking to calculate highly parallelized trig functions (in block of like 1024), and I\'d like to take advantage of at least some of the parallelism that modern architectur
Since you said you were using GCC it looks like there are some options:
That said, I'd probably look into GPGPU for a solution. Maybe writing it in CUDA or OpenCL (If I remember correctly CUDA supports the sine function). Here are some libraries that look like they might make it easier.
Since you are looking to calculate harmonics here, I have some code that addressed a similar problem. It is vectorized already and faster than anything else I have found. As a side benefit, you get the cosine for free.