What are bitwise shift (bit-shift) operators and how do they work?

后端 未结 11 1471
生来不讨喜
生来不讨喜 2020-11-21 04:46

I\'ve been attempting to learn C in my spare time, and other languages (C#, Java, etc.) have the same concept (and often the same operators) ...

What I\'m wondering

11条回答
  •  囚心锁ツ
    2020-11-21 04:57

    Let's say we have a single byte:

    0110110
    

    Applying a single left bitshift gets us:

    1101100
    

    The leftmost zero was shifted out of the byte, and a new zero was appended to the right end of the byte.

    The bits don't rollover; they are discarded. That means if you left shift 1101100 and then right shift it, you won't get the same result back.

    Shifting left by N is equivalent to multiplying by 2N.

    Shifting right by N is (if you are using ones' complement) is the equivalent of dividing by 2N and rounding to zero.

    Bitshifting can be used for insanely fast multiplication and division, provided you are working with a power of 2. Almost all low-level graphics routines use bitshifting.

    For example, way back in the olden days, we used mode 13h (320x200 256 colors) for games. In Mode 13h, the video memory was laid out sequentially per pixel. That meant to calculate the location for a pixel, you would use the following math:

    memoryOffset = (row * 320) + column
    

    Now, back in that day and age, speed was critical, so we would use bitshifts to do this operation.

    However, 320 is not a power of two, so to get around this we have to find out what is a power of two that added together makes 320:

    (row * 320) = (row * 256) + (row * 64)
    

    Now we can convert that into left shifts:

    (row * 320) = (row << 8) + (row << 6)
    

    For a final result of:

    memoryOffset = ((row << 8) + (row << 6)) + column
    

    Now we get the same offset as before, except instead of an expensive multiplication operation, we use the two bitshifts...in x86 it would be something like this (note, it's been forever since I've done assembly (editor's note: corrected a couple mistakes and added a 32-bit example)):

    mov ax, 320; 2 cycles
    mul word [row]; 22 CPU Cycles
    mov di,ax; 2 cycles
    add di, [column]; 2 cycles
    ; di = [row]*320 + [column]
    
    ; 16-bit addressing mode limitations:
    ; [di] is a valid addressing mode, but [ax] isn't, otherwise we could skip the last mov
    

    Total: 28 cycles on whatever ancient CPU had these timings.

    Vrs

    mov ax, [row]; 2 cycles
    mov di, ax; 2
    shl ax, 6;  2
    shl di, 8;  2
    add di, ax; 2    (320 = 256+64)
    add di, [column]; 2
    ; di = [row]*(256+64) + [column]
    

    12 cycles on the same ancient CPU.

    Yes, we would work this hard to shave off 16 CPU cycles.

    In 32 or 64-bit mode, both versions get a lot shorter and faster. Modern out-of-order execution CPUs like Intel Skylake (see http://agner.org/optimize/) have very fast hardware multiply (low latency and high throughput), so the gain is much smaller. AMD Bulldozer-family is a bit slower, especially for 64-bit multiply. On Intel CPUs, and AMD Ryzen, two shifts are slightly lower latency but more instructions than a multiply (which may lead to lower throughput):

    imul edi, [row], 320    ; 3 cycle latency from [row] being ready
    add  edi, [column]      ; 1 cycle latency (from [column] and edi being ready).
    ; edi = [row]*(256+64) + [column],  in 4 cycles from [row] being ready.
    

    vs.

    mov edi, [row]
    shl edi, 6               ; row*64.   1 cycle latency
    lea edi, [edi + edi*4]   ; row*(64 + 64*4).  1 cycle latency
    add edi, [column]        ; 1 cycle latency from edi and [column] both being ready
    ; edi = [row]*(256+64) + [column],  in 3 cycles from [row] being ready.
    

    Compilers will do this for you: See how GCC, Clang, and Microsoft Visual C++ all use shift+lea when optimizing return 320*row + col;.

    The most interesting thing to note here is that x86 has a shift-and-add instruction (LEA) that can do small left shifts and add at the same time, with the performance as an add instruction. ARM is even more powerful: one operand of any instruction can be left or right shifted for free. So scaling by a compile-time-constant that's known to be a power-of-2 can be even more efficient than a multiply.


    OK, back in the modern days... something more useful now would be to use bitshifting to store two 8-bit values in a 16-bit integer. For example, in C#:

    // Byte1: 11110000
    // Byte2: 00001111
    
    Int16 value = ((byte)(Byte1 >> 8) | Byte2));
    
    // value = 000011111110000;
    

    In C++, compilers should do this for you if you used a struct with two 8-bit members, but in practice they don't always.

提交回复
热议问题