Here are two ways to set an individual bit in C on x86-64:
inline void SetBitC(long *array, int bit) {
//Pure C version
*array |= 1<
This is a very very common operation on embedded systems which are generally resource constrained. 10 Cycles vs 5 Cycles is a nasty performance penalty on such systems. There are many cases when one wants to access IO ports or use 16 or 32 bit registers as Boolean bit flags to save memory.
The fact is that if(bit_flags& 1<<12)
is far more readable [and portable when implemented with a library] than the assembly equivalent. Likewise for IO_PINS|= 1<<5;
These are unfortunately many times slower, so the awkward asm macros live on.
In many ways, the goals of embedded and userspace applications are opposite. The responsiveness of external communications (to a User Interface or Machine Interface) are of minor importance, while ensuring a control loop (eqiv. to a micro-bechmark) completes in minimal time is absolutely critical and can make or break a chosen processor or control strategy.
Obviously if one can afford a multi-GHz cpu and all the associated peripherals, chip-sets, etc needed to support that, one does not really need worry about low-level optimisation at all. A 1000x slower micro-controller in a real-time control system means that saving clock cycles is 1000x more important.