efficient algorithm to compare similarity between sets of numbers?

前端 未结 12 671
甜味超标
甜味超标 2021-02-01 11:13

I have a large number of sets of numbers. Each set contains 10 numbers and I need to remove all sets that have 5 or more number (unordered) matches with any other set.

F

12条回答
  •  爱一瞬间的悲伤
    2021-02-01 12:03

    Maybe you need an algorithm such like this (as I understand your problem)?

    import java.util.Arrays;
    import java.util.HashSet;
    import java.util.LinkedList;
    import java.util.List;
    import java.util.Set;
    
    /**
     * @author karnokd, 2009.06.28.
     * @version $Revision 1.0$
     */
    public class NoOverlappingSets {
        // because of the shortcomings of java type inference, O(N)
        public static Set setOf(Integer... values) {
            return new HashSet(Arrays.asList(values));
        }
        // the test function, O(N)
        public static boolean isNumberOfDuplicatesAboveLimit(
                Set first, Set second, int limit) {
            int result = 0;
            for (Integer i : first) {
                if (second.contains(i)) {
                    result++;
                    if (result >= limit) {
                        return true;
                    }
                }
            }
            return false;
        }
        /**
         * @param args
         */
        public static void main(String[] args) {
            // TODO Auto-generated method stub
            List> sets = new LinkedList>() {{
                add(setOf(12,14,222,998,1,89,43,22,7654,23));
                add(setOf(44,23,64,76,987,3,2345,443,431,88));
                add(setOf(998,22,7654,345,112,32,89,9842,31,23));
            }};
            List> resultset = new LinkedList>();
            loop:
            for (Set curr : sets) {
                for (Set existing : resultset) {
                    if (isNumberOfDuplicatesAboveLimit(curr, existing, 5)) {
                        continue loop;
                    }
                }
                // no overlapping with the previous instances
                resultset.add(curr);
            }
            System.out.println(resultset);
        }
    
    }
    

    I'm not an expert in Big O notation but I think this algorithm is O(N*M^2) where N is the number of elements in the set and M is the total number of sets (based on the number of loops I used in the algorithm). I took the liberty of defining what I consider overlapping sets.

    I think your problem is Polinomial. As I remember my lectures, the decision based version would be NP-hard - but correct me if I'm wrong.

提交回复
热议问题