I don\'t want to optimize anything, I swear, I just want to ask this question out of curiosity. I know that on most hardware there\'s an assembly command of bit-shift (e.g.
There are many cases on this.
Many hi-speed MPUs have barrel shifter, multiplexer-like electronic circuit which do any shift in constant time.
If MPU have only 1 bit shift x << 10
would normally be slower, as it mostly done by 10 shifts or byte copying with 2 shifts.
But there is known common case where x << 10
would be even faster than x << 1
. If x is 16 bit, only lower 6 bits of it is care (all other will be shifted out), so MPU need to load only lower byte, thus make only single access cycle to 8-bit memory, while x << 10
need two access cycles. If access cycle is slower than shift (and clear lower byte), x << 10
will be faster. This may apply to microcontrollers with fast onboard program ROM while accessing slow external data RAM.
As addition to case 3, compiler may care about number of significant bits in x << 10
and optimize further operations to lower-width ones, like replacing 16x16 multiplication with 16x8 one (as lower byte is always zero).
Note, some microcontrollers have no shift-left instruction at all, they use add x,x
instead.
As always, it depends on the surrounding code context: e.g. are you using x<<1
as an array index? Or adding it to something else? In either case, small shift counts (1 or 2) can often optimize even more than if the compiler ends up having to just shift. Not to mention the whole throughput vs. latency vs. front-end bottlenecks tradeoff. Performance of a tiny fragment is not one-dimensional.
A hardware shift instructions is not a compiler's only option for compiling x<<1
, but the other answers are mostly assuming that.
x << 1
is exactly equivalent to x+x
for unsigned, and for 2's complement signed integers. Compilers always know what hardware they're targeting while they're compiling, so they can take advantage of tricks like this.
On Intel Haswell, add
has 4 per clock throughput, but shl
with an immediate count has only 2 per clock throughput. (See http://agner.org/optimize/ for instruction tables, and other links in the x86 tag wiki). SIMD vector shifts are 1 per clock (2 in Skylake), but SIMD vector integer adds are 2 per clock (3 in Skylake). Latency is the same, though: 1 cycle.
There's also a special shift-by-one encoding of shl
where the count is implicit in the opcode. 8086 didn't have immediate-count shifts, only by-one and by cl
register. This is mostly relevant for right-shifts, because you can just add for left shifts unless you're shifting a memory operand. But if the value is needed later, it's better to load into a register first. But anyway, shl eax,1
or add eax,eax
is one byte shorter than shl eax,10
, and code-size can directly (decode / front-end bottlenecks) or indirectly (L1I code cache misses) affect performance.
More generally, small shift counts can sometimes be optimized into a scaled index in an addressing mode on x86. Most other architectures in common use these days are RISC, and don't have scaled-index addressing modes, but x86 is a common enough architecture for this to be worth mentioning. (e.g.g if you're indexing an array of 4-byte elements, there's room to increase the scale factor by 1 for int arr[]; arr[x<<1]
).
Needing to copy+shift is common in situations where the original value of x
is still needed. But most x86 integer instructions operate in-place. (The destination is one of the sources for instructions like add
or shl
.) The x86-64 System V calling convention passes args in registers, with the first arg in edi
and return value in eax
, so a function that returns x<<10
also makes the compiler emit copy+shift code.
The LEA instruction lets you shift-and-add (with a shift count of 0 to 3, because it uses addressing-mode machine-encoding). It puts the result in a separate register.
gcc and clang both optimize these functions the same way, as you can see on the Godbolt compiler explorer:
int shl1(int x) { return x<<1; }
lea eax, [rdi+rdi] # 1 cycle latency, 1 uop
ret
int shl2(int x) { return x<<2; }
lea eax, [4*rdi] # longer encoding: needs a disp32 of 0 because there's no base register, only scaled-index.
ret
int times5(int x) { return x * 5; }
lea eax, [rdi + 4*rdi]
ret
int shl10(int x) { return x<<10; }
mov eax, edi # 1 uop, 0 or 1 cycle latency
shl eax, 10 # 1 uop, 1 cycle latency
ret
LEA with 2 components has 1 cycle latency and 2-per-clock throughput on recent Intel and AMD CPUs. (Sandybridge-family and Bulldozer/Ryzen). On Intel, it's only 1 per clock throughput with 3c latency for lea eax, [rdi + rsi + 123]
. (Related: Why is this C++ code faster than my hand-written assembly for testing the Collatz conjecture? goes into this in detail.)
Anyway, copy+shift by 10 needs a separate mov
instruction. It might be zero latency on many recent CPUs, but it still takes front-end bandwidth and code size. (Can x86's MOV really be "free"? Why can't I reproduce this at all?)
Also related: How to multiply a register by 37 using only 2 consecutive leal instructions in x86?.
The compiler is also free to transform the surrounding code so there isn't an actual shift, or it's combined with other operations.
For example if(x<<1) { }
could use an and
to check all bits except the high bit. On x86, you'd use a test
instruction, like test eax, 0x7fffffff
/ jz .false
instead of shl eax,1 / jz
. This optimization works for any shift count, and it also works on machines where large-count shifts are slow (like Pentium 4), or non-existent (some micro-controllers).
Many ISAs have bit-manipulation instructions beyond just shifting. e.g. PowerPC has a lot of bit-field extract / insert instructions. Or ARM has shifts of source operands as part of any other instruction. (So shift/rotate instructions are just a special form of move
, using a shifted source.)
Remember, C is not assembly language. Always look at optimized compiler output when you're tuning your source code to compile efficiently.
It is conceivable that, on an 8-bit processor, x<<1
could actually be much slower than x<<10
for a 16-bit value.
For example a reasonable translation of x<<1
may be:
byte1 = (byte1 << 1) | (byte2 >> 7)
byte2 = (byte2 << 1)
whereas x<<10
would be more simple:
byte1 = (byte2 << 2)
byte2 = 0
Notice how x<<1
shifts more often and even farther than x<<10
. Furthermore the result of x<<10
doesn't depend on the content of byte1. This could speed up the operation additionally.