The hunt for the fastest Hamming Distance C implementation [duplicate]

为君一笑 提交于 2019-12-04 11:05:05

You can make your comparison compare more bytes at a time by doing a bitwise operator on the native integer size.

In your code, you're comparing equality of a byte at a time, but your CPU can compare at least a word in a single cycle, and 8 bytes if it's x86-64. The exact performance capabilities depend on the CPU architecture, of course.

But if you would advance through the two pointers with a stride the size of 8, it could sure be faster in some scenarios. When it has to read from the strings from main memory, the memory load time will actually dominate the performance. But if the strings are in the CPU cache, You might be able to do an XOR, and interpret the results by testing where in the 64bit value the bits are changed.

Counting the buckets that aren't 0 can be done with a variant of the SWAR algorithm starting from 0x33333333 instead of 0x55555555.

The algorithm will be harder to work with, because it will require using uint64_t pointers that have proper memory alignment. You'll need a preamble and postscript that covers the leftover bytes. Maybe you should read the assembly the compiler outputs and see if it's not doing something more clever before you try something more complicated in code.

Instead of

if (*a != *b)
    ++num_mismatches;

this would be faster on some architectures (with 8 bit bytes) because it avoids the branch:

int bits = *a ^ *b;
bits |= bits >> 4;
bits |= bits >> 2;
bits |= bits >> 1;
num_mismatches += bits & 1; 

How about loop unrolling:

while (na >= 8){
  num_mismatches += (a[0] != b[0]);
  num_mismatches += (a[1] != b[1]);
  num_mismatches += (a[2] != b[2]);
  num_mismatches += (a[3] != b[3]);
  num_mismatches += (a[4] != b[4]);
  num_mismatches += (a[5] != b[5]);
  num_mismatches += (a[6] != b[6]);
  num_mismatches += (a[7] != b[7]);
  a += 8; b += 8; na -= 8;
}
if (na >= 4){
  num_mismatches += (a[0] != b[0]);
  num_mismatches += (a[1] != b[1]);
  num_mismatches += (a[2] != b[2]);
  num_mismatches += (a[3] != b[3]);
  a += 4; b += 4; na -= 4;
}
if (na >= 2){
  num_mismatches += (a[0] != b[0]);
  num_mismatches += (a[1] != b[1]);
  a += 2; b += 2; na -= 2;
}
if (na >= 1){
  num_mismatches += (a[0] != b[0]);
  a += 1; b += 1; na -= 1;
}

Also, if you know there are long stretches of equal characters, you could cast the pointers to long* and compare them 4 at a time, and only if not equal look at the individual characters. This code is based on memset and memcpy being fast. It copies the strings into long arrays to 1) eliminate alignment issues, and 2) pad the strings with zeros out to an integer number of longs. As it compares each pair of longs, if they are not equal, it casts the pointers to char* and counts up the unequal characters. The main loop could also be unrolled, similar to the above.

long la[BIG_ENOUGH];
long lb[BIG_ENOUGH];
memset(la, 0, sizeof(la));
memset(lb, 0, sizeof(lb));
memcpy(la, a, na);
memcpy(lb, b, nb);
int nla = (na + 3) & ~3; // assuming sizeof(long) = 4
long *pa = la, *pb = lb;
while(nla >= 1){
  if (pa[0] != pb[0]){
    num_mismatches += (((char*)pa[0])[0] != ((char*)pb[0])[0])
                    + (((char*)pa[0])[1] != ((char*)pb[0])[1])
                    + (((char*)pa[0])[2] != ((char*)pb[0])[2])
                    + (((char*)pa[0])[3] != ((char*)pb[0])[3])
                    ;
  }
  pa += 1;pb += 1; nla -= 1;
}

If the strings are padded with zero to always be 32 bytes and their addresses are 16-aligned, you could do something like this: (code neither tested nor profiled)

movdqa xmm0, [a]
movdqa xmm1, [a + 16]
pcmpeqb xmm0, [b]
pcmpeqb xmm1, [b + 16]
pxor xmm2, xmm2
psadbw xmm0, xmm2
psadbw xmm1, xmm2
pextrw ax, xmm0, 0
pextrw dx, xmm1, 0
add ax, dx
movsx eax, ax
neg eax

But if the strings are usually tiny, it does a lot of unnecessary work and it may not be any faster. It should be faster if the strings are usually (nearly) 32 bytes though.


edit: I wrote this answer before I saw your updated comment - if the strings are usually that tiny, this is probably not very good. A 16-byte version could (maybe) be useful though (run the second iteration conditionally, the branch for that should be well-predicted because it'll be rarely taken). But with such short strings, the normal code is hard to beat.

movdqa xmm0, [a]
pxor xmm1, xmm1
pcmpeqb xmm0, [b]
psadbw xmm0, xmm1
pextrw ax, xmm0, 0
movsx eax, ax
neg eax
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!