What's the fastest way to convert hex to integer in C++?

后端 未结 6 595
甜味超标
甜味超标 2021-02-01 08:52

I\'m trying to convert a hex char to integer as fast as possible.

This is only one line: int x = atoi(hex.c_str);

Is there a faster way

6条回答
  •  清酒与你
    2021-02-01 09:32

    Well, that's a weird question. Converting a single hex char into an integer is so fast, that it is really hard to tell which is faster, because all methods are almost likely faster than the code you write in order to use them =)

    I'll assume the following things:

    1. We have a modern x86(64) CPU.
    2. The input character's ASCII code is stored in a general purpose register, e.g. in eax.
    3. The output integer must be obtained in a general purpose register.
    4. The input character is guaranteed to be a valid hex digit (one of 16 cases).

    Solution

    Now here are several methods for solving the problem: the first one based on lookup, two based on ternary operator, the last one based on bit operations:

    int hextoint_lut(char x) {
        static char lut[256] = {???};
        return lut[uint8_t(x)];
    }
    
    int hextoint_cond(char x) {
        uint32_t dig = x - '0';
        uint32_t alp = dig + ('0' - 'a' + 10);
        return dig <= 9U ? dig : alp;
    }
    int hextoint_cond2(char x) {
        uint32_t offset = (uint8_t(x) <= uint8_t('9') ? '0' : 'a' - 10);
        return uint8_t(x) - offset;
    }
    
    int hextoint_bit(char x) {
        int b = uint8_t(x);
        int mask = (('9' - b) >> 31);
        int offset = '0' + (mask & int('a' - '0' - 10));
        return b - offset;
    }
    

    Here are the corresponding assembly listings generated (only the relevant parts are shown):

    ;hextoint_lut;
    movsx   eax, BYTE PTR [rax+rcx]   ; just load the byte =)
    
    ;hextoint_cond;
    sub edx, 48                       ; subtract '0'
    cmp edx, 9                        ; compare to '9'
    lea eax, DWORD PTR [rdx-39]       ; add ('0' - 'a' + 10)
    cmovbe  eax, edx                  ; choose between two cases in branchless way
    
    ;hextoint_cond2;                  ; (modified slightly)
    mov eax, 48                       
    mov edx, 87                       ; set two offsets to registers
    cmp ecx, 57                       ; compare with '9'
    cmovbe  edx, eax                  ; choose one offset
    sub ecx, edx                      ; subtract the offset
    
    ;hextoint_bit;
    mov ecx, 57                       ; load '9'
    sub ecx, eax                      ; get '9' - x
    sar ecx, 31                       ; convert to mask if negative
    and ecx, 39                       ; set to 39 (for x > '9')
    sub eax, ecx                      ; subtract 39 or 0
    sub eax, 48                       ; subtract '0'
    

    Analysis

    I'll try to estimate number of cycles taken by each approach in throughput sense, which is essentially the time spent per one input number when a lot of numbers are processed at once. Consider a Sandy Bridge architecture as an example.

    The hextoint_lut function consists of a single memory load, which takes 1 uop on port 2 or 3. Both of these ports are dedicated to memory loads, and they also have address calculation inside, which are capable of doing rax+rcx with no additional cost. There are two such ports, each can do one uop in a cycle. So supposedly this version would take 0.5 clock time. If we have to load input number from memory, that would require one more memory load per value, so the total cost would be 1 clock.

    The hextoint_cond version has 4 instructions, but cmov is broken into two separate uops. So there are 5 uops in total, each can be processed on any of the three arithmetic ports 0, 1, and 5. So supposedly it would take 5/3 cycles time. Note that memory load ports are free, so the time would not increase even if you have to load the input value from memory.

    The hextoint_cond2 version has 5 instructions. But in a tight loop the constants can be preloaded to registers, so there would be only comparison, cmov and subtraction. They are 4 uops in total, giving 4/3 cycles per value (even with memory read).

    The hextoint_bit version is a solution which is guaranteed to have no branches and lookup, which is handy if you do not want to check always whether your compiler generated a cmov instruction. The first mov is free, since the constant can be preloaded in a tight loop. The rest are 5 arithmetic instructions, which a 5 uops in ports 0, 1, 5. So it should take 5/3 cycles (even with a memory read).

    Benchmark

    I have performed a benchmark for the C++ functions described above. In a benchmark, 64 KB of random data is generated, then each function is run many times on this data. All the results are added to checksum to ensure that compiler does not remove the code. Manual 8x unrolling is used. I have tested on a Ivy Bridge 3.4 Ghz core, which is very similar to Sandy Bridge. Each string of output contains: name of function, total time taken by benchmark, number of cycles per input value, sum of all outputs.

    Benchmark code

    MSVC2013 x64 /O2:
    hextoint_lut: 0.741 sec, 1.2 cycles  (check: -1022918656)
    hextoint_cond: 1.925 sec, 3.0 cycles  (check: -1022918656)
    hextoint_cond2: 1.660 sec, 2.6 cycles  (check: -1022918656)
    hextoint_bit: 1.400 sec, 2.2 cycles  (check: -1022918656)
    
    GCC 4.8.3 x64 -O3 -fno-tree-vectorize
    hextoint_lut: 0.702 sec, 1.1 cycles  (check: -1114112000)
    hextoint_cond: 1.513 sec, 2.4 cycles  (check: -1114112000)
    hextoint_cond2: 2.543 sec, 4.0 cycles  (check: -1114112000)
    hextoint_bit: 1.544 sec, 2.4 cycles  (check: -1114112000)
    
    GCC 4.8.3 x64 -O3
    hextoint_lut: 0.702 sec, 1.1 cycles  (check: -1114112000)
    hextoint_cond: 0.717 sec, 1.1 cycles  (check: -1114112000)
    hextoint_cond2: 0.468 sec, 0.7 cycles  (check: -1114112000)
    hextoint_bit: 0.577 sec, 0.9 cycles  (check: -1114112000)
    

    Clearly, LUT approach takes one cycle per value (as predicted). The other approaches normally take from 2.2 to 2.6 cycles per value. In case of GCC, hextoint_cond2 is slow because compiler uses cmp+sbb+and magic instead of desired cmov instructions. Also note that by default GCC vectorizes most of the approaches (last paragraph), which provides expectedly faster results than the unvectorizable LUT approach. Note that manual vectorization would give significantly greater boost.

    Discussion

    Note that hextoint_cond with ordinary conditional jump instead of cmov would have a branch. Assuming random input hex digits, it will be mispredicted almost always. So performance would be terrible, I think.

    I have analysed throughput performance. But if we have to process tons of input values, then we should definitely vectorize the conversion to get better speed. hextoint_cond can be vectorized with SSE in a pretty straightforward way. It allows to process 16 bytes to 16 bytes by using only 4 instructions, taking about 2 cycles I suppose.

    Note that in order to see any performance difference, you must ensure that all the input values fit into cache (L1 is the best case). If you read the input data from main memory, even std::atoi is equally fast with the considered methods =)

    Also, you should unroll your main loop 4x or even 8x for maximum performance (to remove looping overhead). As you might have already noticed, the speed of both methods highly depends on which operations are surrounding the code. E.g. adding a memory load doubles time taken by the first approach, but does not influence the other approaches.

    P.S. Most likely you don't really need to optimize this.

提交回复
热议问题