efficient algorithm to compare similarity between sets of numbers?

前端 未结 12 670
甜味超标
甜味超标 2021-02-01 11:13

I have a large number of sets of numbers. Each set contains 10 numbers and I need to remove all sets that have 5 or more number (unordered) matches with any other set.

F

12条回答
  •  -上瘾入骨i
    2021-02-01 11:48

    There is a way to do this with high time efficiency but extremely low space efficiency.

    If my maths is correct, every combination of 5 numbers from a set of 10 results in 10!(10-5)!5! = 252 combinations multiplied by 10000 sets = 2.52 million combinations. A set of 5 integers will consume 20 bytes so you could put every combination for every set into a HashSet. and only use 5 megabytes (plus overhead, which will blow it out by 2-3 times at least).

    Now that might seem expensive but if the alternative, when you check a new set of 10 against the existing 10000 indidvidually, is that you calculate 252 sets of 5 and see if any of them are in the set then it has to be better.

    Basically:

    public class SetOf5 {
      private final static HashSet numbers;
      private final int hashCode;
    
      public SetOf5(int... numbers) {
        if (numbers.length != 5) {
          throw new IllegalArgumentException();
        }
        Set set = new HashSet();
        hashCode = 19;
        for (int i : numbers) {
          set.add(i);
          hashCode = 31 * i + hashCode;
        }
        this.numbers = Collections.unmodifiableSet(set);
      }
    
      // other constructors for passing in, say, an array of 5 or a Collectio of 5
    
      // this is precalculated because it will be called a lot
      public int hashCode() {
        return numbers.hashCode();
      }
    
      public boolean equals(Object ob) {
        if (!(ob instanceof SetOf5)) return false;
        SetOf5 setOf5 = (SetOf5)ob;
        return numbers.containsAll(setOf5.numbers);
      }
    }
    

    You then just have to do two things:

    1. Create a HashSet for all your existing tuples of 5; and
    2. Create an algorithm to create all the possible sets of 5.

    Your algorithm then becomes: for each set of 10 numbers, create all possible sets of 5, check each one to see if it's in the set. If it is, reject the set of 10. If it's not, add the set of 5 to the "set of sets". Otherwise continue.

    I think you'll find that'll be an awful lot cheaper--at least in the case of 5 numbers from 10--than any brute force comparison of 10000 sets with one another.

提交回复
热议问题