Murmur3 hash different result between Python and Java implementation

前端 未结 2 1243
北荒
北荒 2021-01-14 14:04

I have two different program that wish to hash same string using Murmur3 in Python and Java respectively.

Python version 2.7.9:



        
相关标签:
2条回答
  • 2021-01-14 14:45

    If anyone is interested in the reverse answer, converting the python output to the Java output:

    import mmh3
    import string
    
    char_array = '0123456789abcdef'
    mumrmur = mmh3.hash_bytes('abc')
    
    result = [f'{string.hexdigits[(char >> 4) & 0xf]}{string.hexdigits[char & 0xf]}' for char in mumrmur]
    print(''.join(result))
    
    0 讨论(0)
  • 2021-01-14 14:46

    Here's how to get the same result from both:

    byte[] mm3_le = Hashing.murmur3_128().hashString("abc", UTF_8).asBytes();
    byte[] mm3_be = Bytes.toArray(Lists.reverse(Bytes.asList(mm3_le)));
    assertEquals("79267961763742113019008347020647561319",
        new BigInteger(mm3_be).toString());
    

    The hash code's bytes need to be treated as little endian but BigInteger interprets bytes as big endian. You were presumably using new BigInteger(hex, 16) to create the BigInteger, but the output of HashCode.toString() is actually a series of pairs of hexadecimal digits representing the hash bytes in the same order they're returned by asBytes() (little endian). (You can also reverse those pairs of hexadecimal to get a hex number that does produce the same result when passed to new BigInteger(reversedHex, 16)).

    I think the documentation of toString() is somewhat confusing because of the way it refers to "big endian"; it doesn't actually mean that the output of the method is the hexadecimal number representing the bytes interpreted as big endian.

    We have an open issue for adding asBigInteger() to HashCode.

    0 讨论(0)
提交回复
热议问题