efficient algorithm to compare similarity between sets of numbers?

前端 未结 12 657
甜味超标
甜味超标 2021-02-01 11:13

I have a large number of sets of numbers. Each set contains 10 numbers and I need to remove all sets that have 5 or more number (unordered) matches with any other set.

F

相关标签:
12条回答
  • 2021-02-01 11:56

    Another perfect job for a Signature Tree. Once again I'm aghast that there isn't a library out there that implements them. Let me know if you write one.

    From the abstract of the first paper in the search results above:

    We propose a method that represents set data as bitmaps (signatures) and organizes them into a hierarchical index, suitable for similarity search and other related query types. In contrast to a previous technique, the signature tree is dynamic and does not rely on hardwired constants. Experiments with synthetic and real datasets show that it is robust to different data characteristics, scalable to the database size and efficient for various queries.

    0 讨论(0)
  • 2021-02-01 12:02

    You should find the Pearson Coefficient between two sets of data. This method will make your program easily scalable to huge data sets.

    0 讨论(0)
  • 2021-02-01 12:03

    I don't think there's a nice and beautiful way to do it. Most other answers will have you make a comparison between most pairs x,y which would be O(N^2). You can do it faster.

    Algorithm: keep an array of all 5-tuples. For each new split it into all possible 5-tuples, add to that array. At the end, sort and check for duplicates.

    There are C(10, 5) = 10*9*8*7*6/120 = 9*4*7, roughly 250 subsets of length 5 of set of length 10. So you're keeping a table which is 10^3 times larger than your data but perform just O(250*N) operations. That should work practically and I suspect that;s the best theoretically as well.

    0 讨论(0)
  • 2021-02-01 12:03

    Maybe you need an algorithm such like this (as I understand your problem)?

    import java.util.Arrays;
    import java.util.HashSet;
    import java.util.LinkedList;
    import java.util.List;
    import java.util.Set;
    
    /**
     * @author karnokd, 2009.06.28.
     * @version $Revision 1.0$
     */
    public class NoOverlappingSets {
        // because of the shortcomings of java type inference, O(N)
        public static Set<Integer> setOf(Integer... values) {
            return new HashSet<Integer>(Arrays.asList(values));
        }
        // the test function, O(N)
        public static boolean isNumberOfDuplicatesAboveLimit(
                Set<Integer> first, Set<Integer> second, int limit) {
            int result = 0;
            for (Integer i : first) {
                if (second.contains(i)) {
                    result++;
                    if (result >= limit) {
                        return true;
                    }
                }
            }
            return false;
        }
        /**
         * @param args
         */
        public static void main(String[] args) {
            // TODO Auto-generated method stub
            List<Set<Integer>> sets = new LinkedList<Set<Integer>>() {{
                add(setOf(12,14,222,998,1,89,43,22,7654,23));
                add(setOf(44,23,64,76,987,3,2345,443,431,88));
                add(setOf(998,22,7654,345,112,32,89,9842,31,23));
            }};
            List<Set<Integer>> resultset = new LinkedList<Set<Integer>>();
            loop:
            for (Set<Integer> curr : sets) {
                for (Set<Integer> existing : resultset) {
                    if (isNumberOfDuplicatesAboveLimit(curr, existing, 5)) {
                        continue loop;
                    }
                }
                // no overlapping with the previous instances
                resultset.add(curr);
            }
            System.out.println(resultset);
        }
    
    }
    

    I'm not an expert in Big O notation but I think this algorithm is O(N*M^2) where N is the number of elements in the set and M is the total number of sets (based on the number of loops I used in the algorithm). I took the liberty of defining what I consider overlapping sets.

    I think your problem is Polinomial. As I remember my lectures, the decision based version would be NP-hard - but correct me if I'm wrong.

    0 讨论(0)
  • 2021-02-01 12:05

    Lets assume you have a class NumberSet which implements your unordered set (and can enumerate ints to get the numbers). You then need the following data structures and algorithm:

    • Map<int, Set<NumberSet>> numberSets
    • Map<Pair<NumberSet, NumberSet>, int> matchCount
    • Pair<X,Y> is a key object which returns the same hashcode and equality for each instance with the same X and Y (no matter if they are swapped)

    Now for each set to be added/compared do the following (pseudocode!!!):

    for (int number: setToAdd) {
       Set<NumberSet> numbers = numberSets.get(number);
       if (numbers == null) {
          numbers = new HashSet<NumberSet>();
          numberSets.put(number, numbers);
       } else {
          for (NumberSet numberSet: numbers) {
             Pair<NumberSet, NumberSet> pairKey = new Pair<NumberSet, NumberSet>(numberSet, setToAdd);
             matchCount.put(pairKey, matchCount.get(pairKey)+1); // make sure to handle null as 0 here in real code ;)
          }
       }
       numbers.add(number);
    }
    

    At any time you can go through the pairs and each which has a count of 5 or greater shows a duplicate.

    Note: removing the sets may be a bad idea, because if A is considered a duplicate of B, and B a duplicate of C, so C doesn't have to be a duplicate of A. So if you remove B, you'd not remove C, and the order in which you add your sets would become important.

    0 讨论(0)
  • 2021-02-01 12:10

    We will take the data set, adorn each element with a signature, and sort it. The signature has the property that sorting will group those elements together which could have duplicates. When comparing data_set[j] to items in data_set[j+1 ...], when the first signature in [j+1 ...] duplicate check fails we, we advance i. This "adjacency criterion" assures we don't have to look further; no element beyond this can be a duplicate.

    This reduces the O(N^2) comparison a great deal. How much I'll let an algorithm analyst decide, but the code below does ~400k comparisons instead of the 100m of a naive O(N^2).

    The signature starts by bucketing the elements. We divide the range of the numbers into N equal sized buckets: 1..k, k+1..2k, 2k+1..3k, ... When iterating over the elements, we increment the count if the number falls into a particuar bucket. This yields an initial signature of the form (0,0,0,1,3,0,0,...4,2).

    The signature has the property that if

    sum(min(sig_a[i], sig_b[i]) for i in range(10)) >= 5
    

    then it is possible the elements associated with the signatures have at least 5duplicates. But more, if the above does not hold, then the elements cannot have 5 duplicates. Lets call this the "signature match criterion".

    But, sorting by the above signature does not have the adjacency property mentioned above. However, if we modify the signature to be of the two element form:

    (sum(sig[:-1]), sig[-1])
    

    then the "signature match criterion" holds. But does the adjacency criterion hold? Yes. The sum of that signature is 10. If we enumerate, we have the following possible signatures:

    (0,10) (1, 9) (2, 8) (3, 7) (4, 6) (5, 5) (6, 4) (7, 3) (8, 2) (9, 1) (10,0)
    

    If we compare (0,10) against (1,9) .. (10,0), we note that the once the signature test fails it never again becomes true. The adjacency criterion holds. Furthermore, that adjacency criterion holds for all positive values, not just "5".

    Ok, that's nice, but splitting the signature into two large buckets won't necessarily reduce the O(N^2) search; the signature is overly general. We solve that problem by creating a signature for sig[:-1], producing

    (sum(sig[:-1]), sig[-1]), (sum(sig[:-2]), sig[-2]), ...
    

    and so on. I believe this signature still satisfies adjacency, but I could be wrong.

    There are some optimizations I didn't do: the signature only needs the last value of each tuple, not the first, but the sorting step would have to be revised. Also, the signature comparison could be optimized with an early fail when it becomes clear that further scanning is cannot succeed.

    # python 3.0
    import random
    
    # M number of elements, N size of each element
    M = 10000
    N = 10
    
    # Bounds on the size of an element of each set
    Vmin,Vmax = 0, (1 << 12)
    
    # DupCount is number of identical numbers required for a duplicate
    DupCount = 5
    
    # R random number generator, same sequence each time through
    R = random.Random()
    R.seed(42)
    
    # Create a data set of roughly the correct size
    data_set = [list(s) for s in (set(R.randint(Vmin, Vmax) for n in range(N)) for m in range(M)) if len(s) == N]
    
    # Adorn the data_set signatures and sort
    def signature(element, width, n):
    "Return a signature for the element"
        def pearl(l, s):
            def accrete(l, s, last, out):
                if last == 0:
                    return out
                r = l[last]
                return accrete(l, s-r, last-1, out+[(s-r,r)])
            return accrete(l, s, len(l)-1, [])
        l = (n+1) * [0]
        for i in element:
            l[i // width] += 1
        return pearl(l, len(element))
    
    # O(n lg(n)) - with only 10k elements, lg(n) is a little over 13
    adorned_data_set = sorted([signature(element, (Vmax-Vmin+1)//12, 12), element] for element in data_set)
    
    # Count the number of possible intersections
    def compare_signatures(sig_a, sig_b, n=DupCount):
        "Return true if the signatures are compatible"
        for ((head_a, tail_a), (head_b, tail_b)) in zip(sig_a, sig_b):
            n -= min(tail_a, tail_b)
            if n <= 0:
                return True
        return False
    
    k = n = 0
    for i, (sig_a, element_a) in enumerate(adorned_data_set):
        if not element_a:
            continue
        for j in range(i+1, len(adorned_data_set)):
            sig_b, element_b = adorned_data_set[j]
            if not element_b:
                continue
            k += 1
            if compare_signatures(sig_a, sig_b):
                # here element_a and element_b would be compared for equality
                # and the duplicate removed by  adorned_data_set[j][1] = []
                n += 1
            else:
                break
    
    print("maximum of %d out of %d comparisons required" % (n,k))
    
    0 讨论(0)
提交回复
热议问题