I\'m looking to calculate highly parallelized trig functions (in block of like 1024), and I\'d like to take advantage of at least some of the parallelism that modern architectur
Since you are looking to calculate harmonics here, I have some code that addressed a similar problem. It is vectorized already and faster than anything else I have found. As a side benefit, you get the cosine for free.