Hamming distance on binary strings in SQL

前端 未结 2 697
傲寒
傲寒 2020-11-30 01:41

I have a table in my DB where I store SHA256 hashes in a BINARY(32) column. I\'m looking for a way to compute the Hamming distance of the entries in the column to a supplied

相关标签:
2条回答
  • 2020-11-30 01:54

    Interesting question, I've found a way to do this for a binary(3) that might work as well for a binary(32):

    drop table if exists BinaryTest;
    create table  BinaryTest (hash binary(3));
    insert BinaryTest values (0xAAAAAA);
    
    set @supplied = cast(0x888888 as binary);
    
    select  length(replace(concat(
                bin(ascii(substr(hash,1,1)) ^ ascii(substr(@supplied,1,1))),
                bin(ascii(substr(hash,2,1)) ^ ascii(substr(@supplied,2,1))),
                bin(ascii(substr(hash,3,1)) ^ ascii(substr(@supplied,3,1)))
            ),'0',''))
    from    BinaryTest;
    

    The replace removes any all zeroes, and the length of remainder is the number of ones. (The conversion to binary omits leading zeroes, so counting the zeroes would not work.)

    This prints 6, which matches the number of ones in

    0xAAAAAA ^ 0x888888 = 0x222222 = 0b1000100010001000100010
    
    0 讨论(0)
  • 2020-11-30 02:03

    It appears that storing the data in a BINARY column is an approach bound to perform poorly. The only fast way to get decent performance is to split the content of the BINARY column in multiple BIGINT columns, each containing an 8-byte substring of the original data.

    In my case (32 bytes) this would mean using 4 BIGINT columns and using this function:

    CREATE FUNCTION HAMMINGDISTANCE(
      A0 BIGINT, A1 BIGINT, A2 BIGINT, A3 BIGINT, 
      B0 BIGINT, B1 BIGINT, B2 BIGINT, B3 BIGINT
    )
    RETURNS INT DETERMINISTIC
    RETURN 
      BIT_COUNT(A0 ^ B0) +
      BIT_COUNT(A1 ^ B1) +
      BIT_COUNT(A2 ^ B2) +
      BIT_COUNT(A3 ^ B3);
    

    Using this approach, in my testing, is over 100 times faster than using the BINARY approach.


    FWIW, this is the code I was hinting at while explaining the problem. Better ways to accomplish the same thing are welcome (I especially don't like the binary > hex > decimal conversions):

    CREATE FUNCTION HAMMINGDISTANCE(A BINARY(32), B BINARY(32))
    RETURNS INT DETERMINISTIC
    RETURN 
      BIT_COUNT(
        CONV(HEX(SUBSTRING(A, 1,  8)), 16, 10) ^ 
        CONV(HEX(SUBSTRING(B, 1,  8)), 16, 10)
      ) +
      BIT_COUNT(
        CONV(HEX(SUBSTRING(A, 9,  8)), 16, 10) ^ 
        CONV(HEX(SUBSTRING(B, 9,  8)), 16, 10)
      ) +
      BIT_COUNT(
        CONV(HEX(SUBSTRING(A, 17, 8)), 16, 10) ^ 
        CONV(HEX(SUBSTRING(B, 17, 8)), 16, 10)
      ) +
      BIT_COUNT(
        CONV(HEX(SUBSTRING(A, 25, 8)), 16, 10) ^ 
        CONV(HEX(SUBSTRING(B, 25, 8)), 16, 10)
      );
    
    0 讨论(0)
提交回复
热议问题