Here are two ways to set an individual bit in C on x86-64:
inline void SetBitC(long *array, int bit) {
//Pure C version
*array |= 1<
I think you're asking a lot of your optimizer.
You might be able to help it out a little by doing a `register long z = 1L << bit;", then or-ing that with your array.
However, I assume that by 90% more time, you're meaning that the C version takes 10 cycles and the asm version takes 5 cycles, right? How does the performance compare at -O2 or -O1?