optimized itoa function

后端未结

关注

 8  1499

I am thinking on how to implement the conversion of an integer (4byte, unsigned) to string with SSE instructions. The usual routine is to divide the number and store it in a loc

Vectorization

We could execute the above algorithm as-is on the SSE units, but there is almost no gain in performance. However, if we split the value into smaller chunks, we can take advantage of SSE4.1 32-bit multiply instructions. I tried three different splits:

2 groups of 5 digits
3 groups of 4 digits
4 groups of 3 digits

The fastest variant was 4 groups of 3 digits. See below for the results.

Performance

I tested many variants of Terje's algorithm in addition to the algorithms suggested by vitaut and Inge Henriksen. I verified through exhaustive testing of inputs that each algorithm's output matches itoa().

My numbers are taken from a Westmere E5640 running Windows 7 64-bit. I benchmark at real-time priority and locked to core 0. I execute each algorithm 4 times to force everything into the cache. I time 2^24 calls using RDTSCP to remove the effect of any dynamic clock speed changes.

I timed 5 different patterns of inputs:

itoa(0 .. 9) -- nearly best-case performance
itoa(1000 .. 1999) -- longer output, no branch mispredicts
itoa(100000000 .. 999999999) -- longest output, no branch mispredicts
itoa(256 random values) -- varying output length
itoa(65536 random values) -- varying output length and thrashes L1/L2 caches

The data:

ALG        TINY     MEDIUM   LARGE    RND256   RND64K   NOTES
NULL         7 clk    7 clk    7 clk    7 clk    7 clk  Benchmark overhead baseline
TERJE_C     63 clk   62 clk   63 clk   57 clk   56 clk  Best C implementation of Terje's algorithm
TERJE_ASM   48 clk   48 clk   50 clk   45 clk   44 clk  Naive, hand-written AMD64 version of Terje's algorithm
TERJE_SSE   41 clk   42 clk   41 clk   34 clk   35 clk  SSE intrinsic version of Terje's algorithm with 1/3/3/3 digit grouping
INGE_0      12 clk   31 clk   71 clk   72 clk   72 clk  Inge's first algorithm
INGE_1      20 clk   23 clk   45 clk   69 clk   96 clk  Inge's second algorithm
INGE_2      18 clk   19 clk   32 clk   29 clk   36 clk  Improved version of Inge's second algorithm
VITAUT_0     9 clk   16 clk   32 clk   35 clk   35 clk  vitaut's algorithm
VITAUT_1    11 clk   15 clk   33 clk   31 clk   30 clk  Improved version of vitaut's algorithm
LIBC        46 clk  128 clk  329 clk  339 clk  340 clk  MSVCRT12 implementation

My compiler (VS 2013 Update 4) produced surprisingly bad code; the assembly version of Terje's algorithm is just a naive translation, and it's a full 21% faster. I was also surprised at the performance of the SSE implementation, which I expected to be slower. The big surprise was how fast INGE_2, VITAUT_0, and VITAUT_1 were. Bravo to vitaut for coming up with a portable solution that bests even my best effort at the assembly level.

Note: INGE_1 is a modified version of Inge Henriksen's second algorithm because the original has a bug.

INGE_2 is based on the second algorithm that Inge Henriksen gave. Rather than storing pointers to the precalculated strings in a char*[] array, it stores the strings themselves in a char[][5] array. The other big improvement is in how it stores characters in the output buffer. It stores more characters than necessary and uses pointer arithmetic to return a pointer to the first non-zero character. The result is substantially faster -- competitive even with the SSE-optimized version of Terje's algorithm. It should be noted that the microbenchmark favors this algorithm a bit because in real-world applications the 600K data set will constantly blow the caches.

VITAUT_1 is based on vitaut's algorithm with two small changes. The first change is that it copies character pairs in the main loop, reducing the number of store instructions. Similar to INGE_2, VITAUT_1 copies both final characters and uses pointer arithmetic to return a pointer to the string.

Implementation

Here I give code for the 3 most interesting algorithms.

TERJE_ASM:

; char *itoa_terje_asm(char *buf<rcx>, uint32_t val<edx>)
;
; *** NOTE ***
; buf *must* be 8-byte aligned or this code will break!
itoa_terje_asm:
    MOV     EAX, 0xA7C5AC47
    ADD     RDX, 1
    IMUL    RAX, RDX
    SHR     RAX, 48          ; EAX = val / 100000

    IMUL    R11D, EAX, 100000
    ADD     EAX, 1
    SUB     EDX, R11D        ; EDX = (val % 100000) + 1

    IMUL    RAX, 214748      ; RAX = (val / 100000) * 2^31 / 10000
    IMUL    RDX, 214748      ; RDX = (val % 100000) * 2^31 / 10000

    ; Extract buf[0] & buf[5]
    MOV     R8, RAX
    MOV     R9, RDX
    LEA     EAX, [RAX+RAX]   ; RAX = (RAX * 2) & 0xFFFFFFFF
    LEA     EDX, [RDX+RDX]   ; RDX = (RDX * 2) & 0xFFFFFFFF
    LEA     RAX, [RAX+RAX*4] ; RAX *= 5
    LEA     RDX, [RDX+RDX*4] ; RDX *= 5
    SHR     R8, 31           ; R8 = buf[0]
    SHR     R9, 31           ; R9 = buf[5]

    ; Extract buf[1] & buf[6]
    MOV     R10, RAX
    MOV     R11, RDX
    LEA     EAX, [RAX+RAX]   ; RAX = (RAX * 2) & 0xFFFFFFFF
    LEA     EDX, [RDX+RDX]   ; RDX = (RDX * 2) & 0xFFFFFFFF
    LEA     RAX, [RAX+RAX*4] ; RAX *= 5
    LEA     RDX, [RDX+RDX*4] ; RDX *= 5
    SHR     R10, 31 - 8
    SHR     R11, 31 - 8
    AND     R10D, 0x0000FF00 ; R10 = buf[1] << 8
    AND     R11D, 0x0000FF00 ; R11 = buf[6] << 8
    OR      R10D, R8D        ; R10 = buf[0] | (buf[1] << 8)
    OR      R11D, R9D        ; R11 = buf[5] | (buf[6] << 8)

    ; Extract buf[2] & buf[7]
    MOV     R8, RAX
    MOV     R9, RDX
    LEA     EAX, [RAX+RAX]   ; RAX = (RAX * 2) & 0xFFFFFFFF
    LEA     EDX, [RDX+RDX]   ; RDX = (RDX * 2) & 0xFFFFFFFF
    LEA     RAX, [RAX+RAX*4] ; RAX *= 5
    LEA     RDX, [RDX+RDX*4] ; RDX *= 5
    SHR     R8, 31 - 16
    SHR     R9, 31 - 16
    AND     R8D, 0x00FF0000  ; R8 = buf[2] << 16
    AND     R9D, 0x00FF0000  ; R9 = buf[7] << 16
    OR      R8D, R10D        ; R8 = buf[0] | (buf[1] << 8) | (buf[2] << 16)
    OR      R9D, R11D        ; R9 = buf[5] | (buf[6] << 8) | (buf[7] << 16)

    ; Extract buf[3], buf[4], buf[8], & buf[9]
    MOV     R10, RAX
    MOV     R11, RDX
    LEA     EAX, [RAX+RAX]   ; RAX = (RAX * 2) & 0xFFFFFFFF
    LEA     EDX, [RDX+RDX]   ; RDX = (RDX * 2) & 0xFFFFFFFF
    LEA     RAX, [RAX+RAX*4] ; RAX *= 5
    LEA     RDX, [RDX+RDX*4] ; RDX *= 5
    SHR     R10, 31 - 24
    SHR     R11, 31 - 24
    AND     R10D, 0xFF000000 ; R10 = buf[3] << 24
    AND     R11D, 0xFF000000 ; R11 = buf[7] << 24
    AND     RAX, 0x80000000  ; RAX = buf[4] << 31
    AND     RDX, 0x80000000  ; RDX = buf[9] << 31
    OR      R10D, R8D        ; R10 = buf[0] | (buf[1] << 8) | (buf[2] << 16) | (buf[3] << 24)
    OR      R11D, R9D        ; R11 = buf[5] | (buf[6] << 8) | (buf[7] << 16) | (buf[8] << 24)
    LEA     RAX, [R10+RAX*2] ; RAX = buf[0] | (buf[1] << 8) | (buf[2] << 16) | (buf[3] << 24) | (buf[4] << 32)
    LEA     RDX, [R11+RDX*2] ; RDX = buf[5] | (buf[6] << 8) | (buf[7] << 16) | (buf[8] << 24) | (buf[9] << 32)

    ; Compact the character strings
    SHL     RAX, 24          ; RAX = (buf[0] << 24) | (buf[1] << 32) | (buf[2] << 40) | (buf[3] << 48) | (buf[4] << 56)
    MOV     R8, 0x3030303030303030
    SHRD    RAX, RDX, 24     ; RAX = buf[0] | (buf[1] << 8) | (buf[2] << 16) | (buf[3] << 24) | (buf[4] << 32) | (buf[5] << 40) | (buf[6] << 48) | (buf[7] << 56)
    SHR     RDX, 24          ; RDX = buf[8] | (buf[9] << 8)

    ; Store 12 characters. The last 2 will be null bytes.
    OR      R8, RAX
    LEA     R9, [RDX+0x3030]
    MOV     [RCX], R8
    MOV     [RCX+8], R9D

    ; Convert RCX into a bit pointer.
    SHL     RCX, 3

    ; Scan the first 8 bytes for a non-zero character.
    OR      EDX, 0x00000100
    TEST    RAX, RAX
    LEA     R10, [RCX+64]
    CMOVZ   RAX, RDX
    CMOVZ   RCX, R10

    ; Scan the next 4 bytes for a non-zero character.
    TEST    EAX, EAX
    LEA     R10, [RCX+32]
    CMOVZ   RCX, R10
    SHR     RAX, CL          ; N.B. RAX >>= (RCX % 64); this works because buf is 8-byte aligned.

    ; Scan the next 2 bytes for a non-zero character.
    TEST    AX, AX
    LEA     R10, [RCX+16]
    CMOVZ   RCX, R10
    SHR     EAX, CL          ; N.B. RAX >>= (RCX % 32)

    ; Convert back to byte pointer. N.B. this works because the AMD64 virtual address space is 48-bit.
    SAR     RCX, 3

    ; Scan the last byte for a non-zero character.
    TEST    AL, AL
    MOV     RAX, RCX
    LEA     R10, [RCX+1]
    CMOVZ   RAX, R10

    RETN

INGE_2:

uint8_t len100K[100000];
char str100K[100000][5];

void itoa_inge_2_init()
{
    memset(str100K, '0', sizeof(str100K));

    for(uint32_t i = 0; i < 100000; i++)
    {
        char buf[6];
        itoa(i, buf, 10);
        len100K[i] = strlen(buf);
        memcpy(&str100K[i][5 - len100K[i]], buf, len100K[i]);
    }
}

char *itoa_inge_2(char *buf, uint32_t val)
{
    char *p = &buf[10];
    uint32_t prevlen;

    *p = '\0';

    do
    {
        uint32_t const old = val;
        uint32_t mod;

        val /= 100000;
        mod = old - (val * 100000);

        prevlen = len100K[mod];
        p -= 5;
        memcpy(p, str100K[mod], 5);
    }
    while(val != 0);

    return &p[5 - prevlen];
}

VITAUT_1:

static uint16_t const str100p[100] = {
    0x3030, 0x3130, 0x3230, 0x3330, 0x3430, 0x3530, 0x3630, 0x3730, 0x3830, 0x3930,
    0x3031, 0x3131, 0x3231, 0x3331, 0x3431, 0x3531, 0x3631, 0x3731, 0x3831, 0x3931,
    0x3032, 0x3132, 0x3232, 0x3332, 0x3432, 0x3532, 0x3632, 0x3732, 0x3832, 0x3932,
    0x3033, 0x3133, 0x3233, 0x3333, 0x3433, 0x3533, 0x3633, 0x3733, 0x3833, 0x3933,
    0x3034, 0x3134, 0x3234, 0x3334, 0x3434, 0x3534, 0x3634, 0x3734, 0x3834, 0x3934,
    0x3035, 0x3135, 0x3235, 0x3335, 0x3435, 0x3535, 0x3635, 0x3735, 0x3835, 0x3935,
    0x3036, 0x3136, 0x3236, 0x3336, 0x3436, 0x3536, 0x3636, 0x3736, 0x3836, 0x3936,
    0x3037, 0x3137, 0x3237, 0x3337, 0x3437, 0x3537, 0x3637, 0x3737, 0x3837, 0x3937,
    0x3038, 0x3138, 0x3238, 0x3338, 0x3438, 0x3538, 0x3638, 0x3738, 0x3838, 0x3938,
    0x3039, 0x3139, 0x3239, 0x3339, 0x3439, 0x3539, 0x3639, 0x3739, 0x3839, 0x3939, };

char *itoa_vitaut_1(char *buf, uint32_t val)
{
    char *p = &buf[10];

    *p = '\0';

    while(val >= 100)
    {
        uint32_t const old = val;

        p -= 2;
        val /= 100;
        memcpy(p, &str100p[old - (val * 100)], sizeof(uint16_t));
    }

    p -= 2;
    memcpy(p, &str100p[val], sizeof(uint16_t));

    return &p[val < 10];
}

0 讨论(0)

忘掉有多难

2021-02-04 07:01

Interesting problem. If you're interested in a 10 radix only itoa() then I have made a 10 times as fast example and a 3 times as fast example as the typical itoa() implementation.

First example (3x performance)

The first, which is 3 times as fast as itoa(), uses a single-pass non-reversal software design pattern and is based on the open source itoa() implementation found in groff.

// itoaSpeedTest.cpp : Defines the entry point for the console application.
//

#pragma comment(lib, "Winmm.lib") 
#include "stdafx.h"
#include "Windows.h"

#include <iostream>
#include <time.h>

using namespace std;

#ifdef _WIN32
/** a signed 32-bit integer value type */
#define _INT32 __int32
#else
/** a signed 32-bit integer value type */
#define _INT32 long int // Guess what a 32-bit integer is
#endif

/** minimum allowed value in a signed 32-bit integer value type */
#define _INT32_MIN -2147483647

/** maximum allowed value in a signed 32-bit integer value type */
#define _INT32_MAX 2147483647

/** maximum allowed number of characters in a signed 32-bit integer value type including a '-' */
#define _INT32_MAX_LENGTH 11

#ifdef _WIN32

/** Use to init the clock */
#define TIMER_INIT LARGE_INTEGER frequency;LARGE_INTEGER t1, t2;double elapsedTime;QueryPerformanceFrequency(&frequency);

/** Use to start the performance timer */
#define TIMER_START QueryPerformanceCounter(&t1);

/** Use to stop the performance timer and output the result to the standard stream */
#define TIMER_STOP QueryPerformanceCounter(&t2);elapsedTime=(t2.QuadPart-t1.QuadPart)*1000.0/frequency.QuadPart;wcout<<elapsedTime<<L" ms."<<endl;
#else
/** Use to init the clock */
#define TIMER_INIT 

/** Use to start the performance timer */
#define TIMER_START clock_t start;double diff;start=clock();

/** Use to stop the performance timer and output the result to the standard stream */
#define TIMER_STOP diff=(clock()-start)/(double)CLOCKS_PER_SEC;wcout<<fixed<<diff<<endl;
#endif

/** Array used for fast number character lookup */
const char numbersIn10Radix[10] = {'0','1','2','3','4','5','6','7','8','9'};

/** Array used for fast reverse number character lookup */
const char reverseNumbersIn10Radix[10] = {'9','8','7','6','5','4','3','2','1','0'};
const char *reverseArrayEndPtr = &reverseNumbersIn10Radix[9];

/*!
\brief Converts a 32-bit signed integer to a string
\param i [in] Integer
\par Software design pattern
Uses a single pass non-reversing algorithm and is 3x as fast as \c itoa().
\returns Integer as a string
\copyright GNU General Public License
\copyright 1989-1992 Free Software Foundation, Inc.
\date 1989-1992, 2013
\author James Clark<jjc@jclark.com>, 1989-1992
\author Inge Eivind Henriksen<inge@meronymy.com>, 2013
\note Function was originally a part of \a groff, and was refactored & optimized in 2013.
\relates itoa()
*/
const char *Int32ToStr(_INT32 i) 
{   
    // Make room for a 32-bit signed integers digits and the '\0'
    char buf[_INT32_MAX_LENGTH + 2];
    char *p = buf + _INT32_MAX_LENGTH + 1;

    *--p = '\0';

    if (i >= 0) 
    {
        do 
        {
            *--p = numbersIn10Radix[i % 10];
            i /= 10;
        } while (i);
    }
    else
    {
        // Negative integer
        do
        {
            *--p = reverseArrayEndPtr[i % 10];
            i /= 10;
        } while (i);

        *--p = '-';
    }

    return p;
}

int _tmain(int argc, _TCHAR* argv[])
{
    TIMER_INIT

    // Make sure we are playing fair here
    if (sizeof(int) != sizeof(_INT32))
    {
        cerr << "Error: integer size mismatch; test would be invalid." << endl;
        return -1;
    }

    const int steps = 100;
    {
        char intBuffer[20];
        cout << "itoa() took:" << endl;
        TIMER_START;

        for (int i = _INT32_MIN; i < i + steps ; i += steps)
            itoa(i, intBuffer, 10);

        TIMER_STOP;
    }
    {
        cout << "Int32ToStr() took:" << endl;
        TIMER_START;

        for (int i = _INT32_MIN; i < i + steps ; i += steps)
            Int32ToStr(i);

        TIMER_STOP;
    }

    cout << "Done" << endl;
    int wait;
    cin >> wait;
    return 0;
}

On 64-bit Windows the result from running this example is:

itoa() took:
2909.84 ms.
Int32ToStr() took:
991.726 ms.
Done

On 32-bit Windows the result from running this example is:

itoa() took:
3119.6 ms.
Int32ToStr() took:
1031.61 ms.
Done

Second example (10x performance)

If you don't mind spending some time initializing some buffers then it's possible to optimize the function above to be 10x faster than the typical itoa() implementation. What you need to do is to create string buffers rather than character buffers, like this:

// itoaSpeedTest.cpp : Defines the entry point for the console application.
//

#pragma comment(lib, "Winmm.lib") 
#include "stdafx.h"
#include "Windows.h"

#include <iostream>
#include <time.h>

using namespace std;

#ifdef _WIN32
/** a signed 32-bit integer value type */
#define _INT32 __int32

/** a signed 8-bit integer value type */
#define _INT8 __int8

/** an unsigned 8-bit integer value type */
#define _UINT8 unsigned _INT8
#else
/** a signed 32-bit integer value type */
#define _INT32 long int // Guess what a 32-bit integer is

/** a signed 8-bit integer value type */
#define _INT8 char

/** an unsigned 8-bit integer value type */
#define _UINT8 unsigned _INT8
#endif

/** minimum allowed value in a signed 32-bit integer value type */
#define _INT32_MIN -2147483647

/** maximum allowed value in a signed 32-bit integer value type */
#define _INT32_MAX 2147483647

/** maximum allowed number of characters in a signed 32-bit integer value type including a '-' */
#define _INT32_MAX_LENGTH 11

#ifdef _WIN32

/** Use to init the clock */
#define TIMER_INIT LARGE_INTEGER frequency;LARGE_INTEGER t1, t2;double elapsedTime;QueryPerformanceFrequency(&frequency);

/** Use to start the performance timer */
#define TIMER_START QueryPerformanceCounter(&t1);

/** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */
#define TIMER_STOP QueryPerformanceCounter(&t2);elapsedTime=(t2.QuadPart-t1.QuadPart)*1000.0/frequency.QuadPart;wcout<<elapsedTime<<L" ms."<<endl;
#else
/** Use to init the clock to get better precision that 15ms on Windows */
#define TIMER_INIT timeBeginPeriod(10);

/** Use to start the performance timer */
#define TIMER_START clock_t start;double diff;start=clock();

/** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */
#define TIMER_STOP diff=(clock()-start)/(double)CLOCKS_PER_SEC;wcout<<fixed<<diff<<endl;
#endif


 /* Set this as large or small as you want, but has to be in the form 10^n where n >= 1, setting it smaller will
 make the buffers smaller but the performance slower. If you want to set it larger than 100000 then you 
must add some more cases to the switch blocks. Try to make it smaller to see the difference in 
performance. It does however seem to become slower if larger than 100000 */
static const _INT32 numElem10Radix = 100000;

/** Array used for fast lookup number character lookup */
const char *numbersIn10Radix[numElem10Radix] = {};
_UINT8 numbersIn10RadixLen[numElem10Radix] = {};

/** Array used for fast lookup number character lookup */
const char *reverseNumbersIn10Radix[numElem10Radix] = {};
_UINT8 reverseNumbersIn10RadixLen[numElem10Radix] = {};

void InitBuffers()
{
    char intBuffer[20];

    for (int i = 0; i < numElem10Radix; i++)
    {
        itoa(i, intBuffer, 10);
        size_t numLen = strlen(intBuffer);
        char *intStr = new char[numLen + 1];
        strcpy(intStr, intBuffer);
        numbersIn10Radix[i] = intStr;
        numbersIn10RadixLen[i] = numLen;
        reverseNumbersIn10Radix[numElem10Radix - 1 - i] = intStr;
        reverseNumbersIn10RadixLen[numElem10Radix - 1 - i] = numLen;
    }
}

/*!
\brief Converts a 32-bit signed integer to a string
\param i [in] Integer
\par Software design pattern
Uses a single pass non-reversing algorithm with string buffers and is 10x as fast as \c itoa().
\returns Integer as a string
\copyright GNU General Public License
\copyright 1989-1992 Free Software Foundation, Inc.
\date 1989-1992, 2013
\author James Clark<jjc@jclark.com>, 1989-1992
\author Inge Eivind Henriksen, 2013
\note This file was originally a part of \a groff, and was refactored & optimized in 2013.
\relates itoa()
*/
const char *Int32ToStr(_INT32 i) 
{   
    /* Room for INT_DIGITS digits, - and '\0' */
    char buf[_INT32_MAX_LENGTH + 2];
    char *p = buf + _INT32_MAX_LENGTH + 1;
    _INT32 modVal;

    *--p = '\0';

    if (i >= 0) 
    {
        do 
        {
            modVal = i % numElem10Radix;

            switch(numbersIn10RadixLen[modVal])
            {
                case 5:
                    *--p = numbersIn10Radix[modVal][4];
                case 4:
                    *--p = numbersIn10Radix[modVal][3];
                case 3:
                    *--p = numbersIn10Radix[modVal][2];
                case 2:
                    *--p = numbersIn10Radix[modVal][1];
                default:
                    *--p = numbersIn10Radix[modVal][0];
            }

            i /= numElem10Radix;
        } while (i);
    }
    else
    {
        // Negative integer
        const char **reverseArray = &reverseNumbersIn10Radix[numElem10Radix - 1];
        const _UINT8 *reverseArrayLen = &reverseNumbersIn10RadixLen[numElem10Radix - 1];

        do
        {
            modVal = i % numElem10Radix;

            switch(reverseArrayLen[modVal])
            {
                case 5:
                    *--p = reverseArray[modVal][4];
                case 4:
                    *--p = reverseArray[modVal][3];
                case 3:
                    *--p = reverseArray[modVal][2];
                case 2:
                    *--p = reverseArray[modVal][1];
                default:
                    *--p = reverseArray[modVal][0];
            }

            i /= numElem10Radix;
        } while (i);

        *--p = '-';
    }

    return p;
}

int _tmain(int argc, _TCHAR* argv[])
{
    InitBuffers();

    TIMER_INIT

    // Make sure we are playing fair here
    if (sizeof(int) != sizeof(_INT32))
    {
        cerr << "Error: integer size mismatch; test would be invalid." << endl;
        return -1;
    }

    const int steps = 100;
    {
        char intBuffer[20];
        cout << "itoa() took:" << endl;
        TIMER_START;

        for (int i = _INT32_MIN; i < i + steps ; i += steps)
            itoa(i, intBuffer, 10);

        TIMER_STOP;
    }
    {
        cout << "Int32ToStr() took:" << endl;
        TIMER_START;

        for (int i = _INT32_MIN; i < i + steps ; i += steps)
            Int32ToStr(i);

        TIMER_STOP;
    }

    cout << "Done" << endl;
    int wait;
    cin >> wait;
    return 0;
}

On 64-bit Windows the result from running this example is:

itoa() took:
2914.12 ms.
Int32ToStr() took:
306.637 ms.
Done

On 32-bit Windows the result from running this example is:

itoa() took:
3126.12 ms.
Int32ToStr() took:
299.387 ms.
Done

Why do you use reverse string lookup buffers?

It's possible to do this without the reverse string lookup buffers (thus saving 1/2 the internal memory), but this makes it significantly slower (timed at about 850 ms on 64-bit and 380 ms on 32-bit systems). It's not clear to me exactly why it's so much slower - especially on 64-bit systems, to test this further yourself you can change simply the following code:

#define _UINT32 unsigned _INT32
...
static const _UINT32 numElem10Radix = 100000;
...
void InitBuffers()
{
    char intBuffer[20];

    for (int i = 0; i < numElem10Radix; i++)
    {
        _itoa(i, intBuffer, 10);
        size_t numLen = strlen(intBuffer);
        char *intStr = new char[numLen + 1];
        strcpy(intStr, intBuffer);
        numbersIn10Radix[i] = intStr;
        numbersIn10RadixLen[i] = numLen;
    }
}
...
const char *Int32ToStr(_INT32 i) 
{   
    char buf[_INT32_MAX_LENGTH + 2];
    char *p = buf + _INT32_MAX_LENGTH + 1;
    _UINT32 modVal;

    *--p = '\0';

    _UINT32 j = i;

    do 
    {
        modVal = j % numElem10Radix;

        switch(numbersIn10RadixLen[modVal])
        {
            case 5:
                *--p = numbersIn10Radix[modVal][4];
            case 4:
                *--p = numbersIn10Radix[modVal][3];
            case 3:
                *--p = numbersIn10Radix[modVal][2];
            case 2:
                *--p = numbersIn10Radix[modVal][1];
            default:
                *--p = numbersIn10Radix[modVal][0];
        }

        j /= numElem10Radix;
    } while (j);

    if (i < 0) *--p = '-';

    return p;
}

0 讨论(0)

终归单人心

2021-02-04 07:06
The first step to optimizing your code is getting rid of the arbitrary base support. This is because dividing by a constant is almost surely multiplication, but dividing by base is division, and because '0'+n is faster than "0123456789abcdef"[n] (no memory involved in the former).

If you need to go beyond that, you could make lookup tables for each byte in the base you care about (e.g. 10), then vector-add the (e.g. decimal) results for each byte. As in:
```
00 02 00 80 (input)

 0000000000 (place3[0x00])
+0000131072 (place2[0x02])
+0000000000 (place1[0x00])
+0000000128 (place0[0x80])
 ==========
 0000131200 (result)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
故里飘歌

2021-02-04 07:06

http://sourceforge.net/projects/itoa/

Its uses a big static const array of all 4-digits integers and uses it for 32-bits or 64-bits conversion to string.

Portable, no need of a specific instruction set.

The only faster version I could find was in assembly code and limited to 32 bits.

0 讨论(0)
发布评论:

提交评论
- 加载中...
[愿得一人]

2021-02-04 07:07

This post compares several methods of integer to string conversion aka itoa. The fastest method reported there is fmt::FormatInt from the fmt library which is about 8 times faster than sprintf/std::stringstream and 5 times faster than the naive ltoa/itoa implementation (the actual numbers may of course vary depending on platform).

Unlike most other methods fmt::FormatInt does one pass over the digits. It also minimizes the number of integer divisions using the idea from Alexandrescu's talk Three Optimization Tips for C++. The implementation is available here.

This is of course if C++ is an option and you are not restricted by the itoa's API.

Disclaimer: I'm the author of this method and the fmt library.

0 讨论(0)
发布评论:

提交评论
- 加载中...

小鲜肉

2021-02-04 07:12

That's part of my code in asm. It works only for range 255-0 It can be faster however here you can find direction and main idea.

4 imuls 1 memory read 1 memory write

You can try to reduce 2 imule's and use lea's with shifting. However you can't find anything faster in C/C++/Python ;)

void itoa_asm(unsigned char inVal, char *str)
{
    __asm
    {
        // eax=100's      -> (some_integer/100) = (some_integer*41) >> 12
        movzx esi,inVal
        mov eax,esi
        mov ecx,41
        imul eax,ecx
        shr eax,12

        mov edx,eax
        imul edx,100
        mov edi,edx

        // ebx=10's       -> (some_integer/10) = (some_integer*205) >> 11
        mov ebx,esi
        sub ebx,edx
        mov ecx,205
        imul ebx,ecx
        shr ebx,11

        mov edx,ebx
        imul edx,10

        // ecx = 1
        mov ecx,esi
        sub ecx,edx    // -> sub 10's
        sub ecx,edi    // -> sub 100's

        add al,'0'
        add bl,'0'
        add cl,'0'
        //shl eax,
        shl ebx,8
        shl ecx,16
        or eax,ebx
        or eax,ecx

        mov edi,str
        mov [edi],eax

    }

}

0 讨论(0)

1 2 下一页