What's a time efficient algorithm to copy unaligned bit arrays?

后端 未结 4 1960
醉梦人生
醉梦人生 2021-02-09 19:57

I\'ve had to do this many times in the past, and I\'ve never been satisfied with the results.

Can anyone suggest a fast way of copying a contiguous bit array fro

4条回答
  •  星月不相逢
    2021-02-09 20:30

    What is optimal will depend upon the target platform. On some platforms without barrel shifters, shifting the whole vector right or left one bit, n times, for n<3, will be the fastest approach (on the PIC18 platform, an 8x-unrolled byte loop to shift left one bit will cost 11 instruction cycles per eight bytes). Otherwise, I like the pattern (note src2 will have to be initialized depending upon what you want done with the end of your buffer)

      src1 = *src++;
      src2 = (src1 shl shiftamount1) | (src2 shr shiftamount2);
      *dest++ = src2;
      src2 = *src++;
      src1 = (src2 shl shiftamount1) | (src1 shr shiftamount2);
      *dest++ = src1;
    

    That should lend itself to very efficient implementation on an ARM (eight instructions every two words, if registers are available for src, dest, src1, src2, shiftamount1, and shiftamount2. Using more registers would allow faster operation via multi-word load/store instructions. Handling four words would be something like (one machine instruction per line, except the first four lines would together be one instruction, as would the last four lines ):

      src0 = *src++;
      src1 = *src++;
      src2 = *src++;
      src3 = *src++;
      tmp  = src0;
      src0 = src0 shr shiftamount1
      src0 = src0 | src1 shl shiftamount2
      src1 = src1 shr shiftamount1
      src1 = src1 | src2 shl shiftamount2
      src2 = src2 shr shiftamount1
      src2 = src2 | src3 shl shiftamount2
      src3 = src3 shr shiftamount1
      src3 = src3 | tmp shl shiftamount2
      *dest++ = src0;
      *dest++ = src1;
      *dest++ = src2;
      *dest++ = src3;
    

    Eleven instructions per 16 bytes rotated.

提交回复
热议问题