How do I calculate a good hash code for a list of strings?

后端 未结 11 1472
我寻月下人不归
我寻月下人不归 2020-12-01 02:34

Background:

  • I have a short list of strings.
  • The number of strings is not always the same, but are nearly always of the order of a “handful”
  • I
相关标签:
11条回答
  • 2020-12-01 03:30

    Another way that pops in my head, chain xors with rotated hashes based on index:

    int shift = 0;
    int result = 1;
    for(String s : strings)
    {
        result ^= (s.hashCode() << shift) | (s.hashCode() >> (32-shift)) & (1 << shift - 1);
        shift = (shift+1)%32;
    }
    

    edit: reading the explanation given in effective java, I think geoff's code would be much more efficient.

    0 讨论(0)
  • 2020-12-01 03:30

    A SQL-based solution could be based on the checksum and checksum_agg functions. If I'm following it right, you have something like:

    MyTable
      MyTableId
      HashCode
    
    MyChildTable
      MyTableId  (foreign key into MyTable)
      String
    

    with the various strings for a given item (MyTableId) stored in MyChildTable. To calculate and store a checksum reflecting these (never-to-be-changed) strings, something like this should work:

    UPDATE MyTable
     set HashCode = checksum_agg(checksum(string))
     from MyTable mt
      inner join MyChildTable ct
       on ct.MyTableId = mt.MyTableId
     where mt.MyTableId = @OnlyForThisOne
    

    I believe this is order-independant, so strings "The quick brown" would produce the same checksum as "brown The quick".

    0 讨论(0)
  • 2020-12-01 03:35

    Using the GetHashCode() is not ideal for combining multiple values. The problem is that for strings, the hashcode is just a checksum. This leaves little entropy for similar values. e.g. adding hashcodes for ("abc", "bbc") will be the same as ("abd", "abc"), causing a collision.

    In cases where you need to be absolutely sure, you'd use a real hash algorithm, like SHA1, MD5, etc. The only problem is that they are block functions, which is difficult to quickly compare hashes for equality. Instead, try a CRC or FNV1 hash. FNV1 32-bit is super simple:

    public static class Fnv1 {
        public const uint OffsetBasis32 = 2166136261;
        public const uint FnvPrime32 = 16777619;
    
        public static int ComputeHash32(byte[] buffer) {
            uint hash = OffsetBasis32;
    
            foreach (byte b in buffer) {
                hash *= FnvPrime32;
                hash ^= b;
            }
    
            return (int)hash;
        }
    }
    
    0 讨论(0)
  • 2020-12-01 03:37

    I hope this is unnecessary, but since you don't mention anything which sounds like you're only using the hashcodes for a first check and then later verifying that the strings are actually equal, I feel the need to warn you:

    Hashcode equality != value equality

    There will be lots of sets of strings which yield the identical hashcode, but won't always be equal.

    0 讨论(0)
  • 2020-12-01 03:39

    Your first option has the only inconvenience of (String1, String2) producing the same hashcode of (String2, String1). If that's not a problem (eg. because you have a fix order) it's fine.

    "Cat all the string together then get the hashcode" seems the more natural and secure to me.

    Update: As a comment points out, this has the drawback that the list ("x", "yz") and ("xy","z") would give the same hash. To avoid this, you could join the strings with a string delimiter that cannot appear inside the strings.

    If the strings are big, you might prefer to hash each one, cat the hashcodes and rehash the result. More CPU, less memory.

    0 讨论(0)
提交回复
热议问题