Convert 0x1234 to 0x11223344

前端 未结 13 981
我在风中等你
我在风中等你 2021-01-30 13:07

How do I expand the hexadecimal number 0x1234 to 0x11223344 in a high-performance way?

unsigned int c = 0x1234, b;
b = (c & 0xff) << 4 | c & 0xf |          


        
13条回答
  •  一生所求
    2021-01-30 13:17

    I'm not sure what the most efficient way would be, but this is a little shorter:

    #include 
    
    int main()
    {
      unsigned x = 0x1234;
    
      x = (x << 8) | x;
      x = ((x & 0x00f000f0) << 4) | (x & 0x000f000f);
      x = (x << 4) | x;
    
      printf("0x1234 -> 0x%08x\n",x);
    
      return 0;
    }
    

    If you need to do this repeatedly and very quickly, as suggested in your edit, you could consider generating a lookup table and using that instead. The following function dynamically allocates and initializes such a table:

    unsigned *makeLookupTable(void)
    {
      unsigned *tbl = malloc(sizeof(unsigned) * 65536);
      if (!tbl) return NULL;
      int i;
      for (i = 0; i < 65536; i++) {
        unsigned x = i;
        x |= (x << 8);
        x = ((x & 0x00f000f0) << 4) | (x & 0x000f000f);
        x |= (x << 4);
    
        /* Uncomment next line to invert the high byte as mentioned in the edit. */
        /* x = x ^ 0xff000000; */
    
        tbl[i] = x;
      }
      return tbl;
    }
    

    After that each conversion is just something like:

    result = lookuptable[input];
    

    ..or maybe:

    result = lookuptable[input & 0xffff];
    

    Or a smaller, more cache-friendly lookup table (or pair) could be used with one lookup each for the high and low bytes (as noted by @LưuVĩnhPhúc in the comments). In that case, table generation code might be:

    unsigned *makeLookupTableLow(void)
    {
      unsigned *tbl = malloc(sizeof(unsigned) * 256);
      if (!tbl) return NULL;
      int i;
      for (i = 0; i < 256; i++) {
        unsigned x = i;
        x = ((x & 0xf0) << 4) | (x & 0x0f);
        x |= (x << 4);
        tbl[i] = x;
      }
      return tbl;
    }
    

    ...and an optional second table:

    unsigned *makeLookupTableHigh(void)
    {
      unsigned *tbl = malloc(sizeof(unsigned) * 256);
      if (!tbl) return NULL;
      int i;
      for (i = 0; i < 256; i++) {
        unsigned x = i;
        x = ((x & 0xf0) << 20) | ((x & 0x0f) << 16);
        x |= (x << 4);
    
        /* uncomment next line to invert high byte */
        /* x = x ^ 0xff000000; */
    
        tbl[i] = x;
      }
      return tbl;
    }
    

    ...and to convert a value with two tables:

    result = hightable[input >> 8] | lowtable[input & 0xff];
    

    ...or with one (just the low table above):

    result = (lowtable[input >> 8] << 16) | lowtable[input & 0xff];
    result ^= 0xff000000; /* to invert high byte */
    

    If the upper part of the value (alpha?) doesn't change much, even the single large table might perform well since consecutive lookups would be closer together in the table.


    I took the performance test code @Apriori posted, made some adjustments, and added tests for the other responses that he hadn't included originally... then compiled three versions of it with different settings. One is 64-bit code with SSE4.1 enabled, where the compiler can make use of SSE for optimizations... and then two 32-bit versions, one with SSE and one without. Although all three were run on the same fairly recent processor, the results show how the optimal solution can change depending on the processor features:

                               64b SSE4.1  32b SSE4.1  32b no SSE
    -------------------------- ----------  ----------  ----------
    ExpandOrig           time:  3.502 s     3.501 s     6.260 s
    ExpandLookupSmall    time:  3.530 s     3.997 s     3.996 s
    ExpandLookupLarge    time:  3.434 s     3.419 s     3.427 s
    ExpandIsalamon       time:  3.654 s     3.673 s     8.870 s
    ExpandIsalamonOpt    time:  3.784 s     3.720 s     8.719 s
    ExpandChronoKitsune  time:  3.658 s     3.463 s     6.546 s
    ExpandEvgenyKluev    time:  6.790 s     7.697 s    13.383 s
    ExpandIammilind      time:  3.485 s     3.498 s     6.436 s
    ExpandDmitri         time:  3.457 s     3.477 s     5.461 s
    ExpandNitish712      time:  3.574 s     3.800 s     6.789 s
    ExpandAdamLiss       time:  3.673 s     5.680 s     6.969 s
    ExpandAShelly        time:  3.524 s     4.295 s     5.867 s
    ExpandAShellyMulOp   time:  3.527 s     4.295 s     5.852 s
    ExpandSSE4           time:  3.428 s
    ExpandSSE4Unroll     time:  3.333 s
    ExpandSSE2           time:  3.392 s
    ExpandSSE2Unroll     time:  3.318 s
    ExpandAShellySSE4    time:  3.392 s
    

    The executables were compiled on 64-bit Linux with gcc 4.8.1, using -m64 -O3 -march=core2 -msse4.1, -m32 -O3 -march=core2 -msse4.1 and -m32 -O3 -march=core2 -mno-sse respectively. @Apriori's SSE tests were omitted for the 32-bit builds (crashed on 32-bit with SSE enabled, and obviously won't work with SSE disabled).

    Among the adjustments made was to use actual image data instead of random values (photos of objects with transparent backgrounds), which greatly improved the performance of the large lookup table but made little difference for the others.

    Essentially, the lookup tables win by a landslide when SSE is unnavailable (or unused)... and the manually coded SSE solutions win otherwise. However, it's also noteworthy that when the compiler could use SSE for optimizations, most of the bit manipulation solutions were almost as fast as the manually coded SSE -- still slower, but only marginally.

提交回复
热议问题