efficient algorithm to compare similarity between sets of numbers?

前端 未结 12 672
甜味超标
甜味超标 2021-02-01 11:13

I have a large number of sets of numbers. Each set contains 10 numbers and I need to remove all sets that have 5 or more number (unordered) matches with any other set.

F

12条回答
  •  余生分开走
    2021-02-01 11:47

    it's an easy problem because your sets are limited to size of ten. For every set of ten numbers you have less than 1,000 subsets of the set which contain at least five numbers. Select a hash function that hashes integer sequences into, say, 32-bit numbers. For every set of ten integers, calculate the value of this hash function for every subset of integers with five or more elements. This gives less than 1,000 hash values per one set of ten numbers. Add a pointer to the set of ten integers to a hash table under all these 1,000 keys. Once you have done this, your hash table has 1,000 * 10,000 = 10 million entries, which is completely doable; and this first pass is linear (O(n)) because the individual set size is bounded by 10.

    In the next pass, iterate through all the hash values in whatever order. Whenever there are more than one set associated with the same hash value, this means that most likely they contain a common subset of at least five integers. Verify this, and then erase one of the sets and the corresponding hash table entries. Continue through the hash table. This is also an O(n) step.

    Finally, suppose that you are doing this in C. Here is a routine that would calculate the hash values for a single set of ten integers. It is assumed that the integers are in ascending order:

    static int hash_index;
    
    void calculate_hash(int *myset, unsigned int *hash_values)
    {
      hash_index = 0;
      hrec(myset, hash_values, 0, 0, 0);
    }
    
    void hrec(int *myset, unsigned int *hash_values,
              unsigned int h, int idx, int card)
    {
      if (idx == 10) {
        if (card >= 5) {
          hash_values[hash_index++] = h;
        }
        return;
      }
      unsigned int hp = h;
      hp += (myset[idx]) + 0xf0f0f0f0;
      hp += (hp << 13) | (hp >> 19);
      hp *= 0x7777;
      hp += (hp << 13) | (hp >> 19);
      hrec(myset, hash_values, hp, idx + 1, card + 1);
      hrec(myset, hash_values, h,  idx + 1, card);
    }
    

    This recurses through all the 1024 subsets and stores the hash values for subsets with cardinality 5 or more in the hash_values array. At the end, hash_index counts the number of valid entries. It is of course constant but I didn't calculate it numerically here.

提交回复
热议问题