Hamming distance on binary strings in SQL

三世轮回 提交于 2019-11-27 03:41:25

It appears that storing the data in a BINARY column is an approach bound to perform poorly. The only fast way to get decent performance is to split the content of the BINARY column in multiple BIGINT columns, each containing an 8-byte substring of the original data.

In my case (32 bytes) this would mean using 4 BIGINT columns and using this function:

CREATE FUNCTION HAMMINGDISTANCE(
  A0 BIGINT, A1 BIGINT, A2 BIGINT, A3 BIGINT, 
  B0 BIGINT, B1 BIGINT, B2 BIGINT, B3 BIGINT
)
RETURNS INT DETERMINISTIC
RETURN 
  BIT_COUNT(A0 ^ B0) +
  BIT_COUNT(A1 ^ B1) +
  BIT_COUNT(A2 ^ B2) +
  BIT_COUNT(A3 ^ B3);

Using this approach, in my testing, is over 100 times faster than using the BINARY approach.


FWIW, this is the code I was hinting at while explaining the problem. Better ways to accomplish the same thing are welcome (I especially don't like the binary > hex > decimal conversions):

CREATE FUNCTION HAMMINGDISTANCE(A BINARY(32), B BINARY(32))
RETURNS INT DETERMINISTIC
RETURN 
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 1,  8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 1,  8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 9,  8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 9,  8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 17, 8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 17, 8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 25, 8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 25, 8)), 16, 10)
  );

Interesting question, I've found a way to do this for a binary(3) that might work as well for a binary(32):

drop table if exists BinaryTest;
create table  BinaryTest (hash binary(3));
insert BinaryTest values (0xAAAAAA);

set @supplied = cast(0x888888 as binary);

select  length(replace(concat(
            bin(ascii(substr(hash,1,1)) ^ ascii(substr(@supplied,1,1))),
            bin(ascii(substr(hash,2,1)) ^ ascii(substr(@supplied,2,1))),
            bin(ascii(substr(hash,3,1)) ^ ascii(substr(@supplied,3,1)))
        ),'0',''))
from    BinaryTest;

The replace removes any all zeroes, and the length of remainder is the number of ones. (The conversion to binary omits leading zeroes, so counting the zeroes would not work.)

This prints 6, which matches the number of ones in

0xAAAAAA ^ 0x888888 = 0x222222 = 0b1000100010001000100010
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!