I have a table in my DB where I store SHA256 hashes in a BINARY(32) column. I\'m looking for a way to compute the Hamming distance of the entries in the column to a supplied
Interesting question, I've found a way to do this for a binary(3)
that might work as well for a binary(32)
:
drop table if exists BinaryTest;
create table BinaryTest (hash binary(3));
insert BinaryTest values (0xAAAAAA);
set @supplied = cast(0x888888 as binary);
select length(replace(concat(
bin(ascii(substr(hash,1,1)) ^ ascii(substr(@supplied,1,1))),
bin(ascii(substr(hash,2,1)) ^ ascii(substr(@supplied,2,1))),
bin(ascii(substr(hash,3,1)) ^ ascii(substr(@supplied,3,1)))
),'0',''))
from BinaryTest;
The replace
removes any all zeroes, and the length of remainder is the number of ones. (The conversion to binary omits leading zeroes, so counting the zeroes would not work.)
This prints 6
, which matches the number of ones in
0xAAAAAA ^ 0x888888 = 0x222222 = 0b1000100010001000100010
It appears that storing the data in a BINARY
column is an approach bound to perform poorly. The only fast way to get decent performance is to split the content of the BINARY
column in multiple BIGINT
columns, each containing an 8-byte substring of the original data.
In my case (32 bytes) this would mean using 4 BIGINT
columns and using this function:
CREATE FUNCTION HAMMINGDISTANCE(
A0 BIGINT, A1 BIGINT, A2 BIGINT, A3 BIGINT,
B0 BIGINT, B1 BIGINT, B2 BIGINT, B3 BIGINT
)
RETURNS INT DETERMINISTIC
RETURN
BIT_COUNT(A0 ^ B0) +
BIT_COUNT(A1 ^ B1) +
BIT_COUNT(A2 ^ B2) +
BIT_COUNT(A3 ^ B3);
Using this approach, in my testing, is over 100 times faster than using the BINARY
approach.
FWIW, this is the code I was hinting at while explaining the problem. Better ways to accomplish the same thing are welcome (I especially don't like the binary > hex > decimal conversions):
CREATE FUNCTION HAMMINGDISTANCE(A BINARY(32), B BINARY(32))
RETURNS INT DETERMINISTIC
RETURN
BIT_COUNT(
CONV(HEX(SUBSTRING(A, 1, 8)), 16, 10) ^
CONV(HEX(SUBSTRING(B, 1, 8)), 16, 10)
) +
BIT_COUNT(
CONV(HEX(SUBSTRING(A, 9, 8)), 16, 10) ^
CONV(HEX(SUBSTRING(B, 9, 8)), 16, 10)
) +
BIT_COUNT(
CONV(HEX(SUBSTRING(A, 17, 8)), 16, 10) ^
CONV(HEX(SUBSTRING(B, 17, 8)), 16, 10)
) +
BIT_COUNT(
CONV(HEX(SUBSTRING(A, 25, 8)), 16, 10) ^
CONV(HEX(SUBSTRING(B, 25, 8)), 16, 10)
);