Quickest algorithm for finding sets with high intersection

后端 未结 4 494
予麋鹿
予麋鹿 2021-02-01 06:37

I have a large number of user IDs (integers), potentially millions. These users all belong to various groups (sets of integers), such that there are on the order of 10 million

相关标签:
4条回答
  • 2021-02-01 07:09

    I would do exactly what you propose: map users to their group. That is, I would keep a list of group ids for every user. Then I would use the following algorithm:

    foreach group:
      map = new Map<Group, int>  // maps groups to count
      foreach user in group:
        foreach userGroup in user.groups:
          map[userGroup]++
          if( map[userGroup] == 15 && userGroup.id > group.id )
            largeIntersection( group, userGroup )
    

    Given you have G groups each containing U users on average, and given that these users belong to g groups on average, then this will run in O( G*U*g ). Which, given your problem, is probably much faster than the naive pairwise comparison of groups which runs in O(G*G*U).

    0 讨论(0)
  • 2021-02-01 07:18

    If the vast majority of intersections are 0, that means the number of non-empty intersections is relatively small. Give this a try:

    • Throw away all sets of size <15 before you start
    • Calculate your lookup from userid -> list of sets to which it belongs
    • Create a map<pair<userset, userset>, int>
    • For each user, increment (after creating if necessary), n*(n-1)/2 entries of that map, where n is the number of sets to which the user belongs.
    • When that's finished, scan the map for entries where the value is greater than 15.

    It will use more memory than the simple approach of computing every intersection. In fact it will run up against what's feasible: if each set on average intersects with just 10 others, perhaps in very small intersections, then the map needs 50M entries, which is starting to be a lot of RAM. It's also woefully cache-unfriendly.

    It might be faster than doing all the set-intersections, because the O(n^2) terms relate to the number of non-empty intersections and the number of groups to which each user belongs, rather than to the number of sets.

    Parallelizing isn't trivial, because of the contention on the giant map. However, you can shard that into a map for each thread, and periodically give one thread a new, empty, map and add the results-so-far into the total results. The different threads then run completely independently most of the time, each given a list of users to process.

    0 讨论(0)
  • 2021-02-01 07:19

    One way is to see your problem as a metric space radius search problem where the distance function is the number of not matching entries and the radius is r = max(number of elements in sets) - number of equal. Filtering of the found elements is needed to see that enough values are in the Set. So unless someone comes up with a metric function that can be used directly this solutions has to much constraints.

    One of the data structures for metric search is the BK-Tree that can be used for string similarity searches.

    Candidates for your problem are VP-tree and M-Trees.

    The worst case for metric tree is O(n^2) when you're searching for a distance > m (maximum number of elements in the sets) as you build up the tree in O(log n * n) and search in O(n^2).

    Apart for that the actual runtime complexity depends on the abbility to prune subtrees of the metric tree while executing the search. In a metric tree a subtree can be skipped if the distance of the pivot element to the search element is larger than the radius of the pivot element (which is at least the maximum distance of a ancestores to the pivot element). If your entry sets are rather disjunct the overall runtime will be dominated by the build-up time of the metric tree O(log n * n).

    0 讨论(0)
  • 2021-02-01 07:26

    Should I compare every pair of sets? (If I keep a data structure that maps userIDs to set membership, this would not be necessary.)

    In order to count the degree of intersection, you still need to visit the other groups the user has, which is still cubic. You could have a hash table or other sparse array to count, but it would still at best require an increment for each user for each pair of groups each user is in. If you have N users in G groups with an average of S users per group and T number of groups each user is in, you've got GGS/2 for comparing each pair of groups and NTT if you have an index of user to group. T = GS/N, so NTT=GGSS/N; for S=20 and N in the millions there should be an advantage. Unfortunately, you also need at least G*G storage for the intersection count (25 TB or so for a 4 bit non-sparse counter), and have to ensure the structure can be incremented in parallel.

    For a million users in 10 million groups of 20, very approximately the probability that a user is in a given group is 2e-6, and the probability that two groups will share users will be 40e-6, so the 25TB comes down to 1GB data, so not impossible for a sparse array on an ordinary sized computer.

    However, comparing set of 20 elements for 15 elements in common has more obvious optimisations

    • If the groups are sorted, you require no working storage, just output the degree of difference between the input groups directly.
    • Most of the memory accesses will be linear in contiguous memory areas, and the results are dependent only on the two sets being compared, rather than relying on summing across the whole data set. Accessing main memory randomly is significantly slower than accessing it linearly. Altering main memory randomly with bus locks is orders of magnitude slower than accessing cache without the need to lock the bus (though if you have a couple of GB per core you can use the user->group approach without having to do any synchronisation).
    • Only need to count 5 elements which are distinct between the sets; if the data is random then most sets will be disjoint, so the average number of elements visited is smaller.
    • You can quickly discount certain groups by treating the difference as a distance (if A is 11 different from B, and C is 5 different from B, then C is between 6 and 16 different from A, so can be discounted without comparing A and C directly). Since most the sets are completely disjoint, this won't gain you much though.

    There is also the option of a hybrid approach, using the user->group map to prune the set of group to group comparisons to be made. This has the advantage of not requiring incrementing of a shared data structure:

    • for each pair of groups that a user is in, add that pair to the list to investigate.
    • sort the list of pairs of groups with at least one user in common.
    • the number of times each pair occurs in the list is the number of users they have in common.

    Using merge sort, this is very each to parallelise into pure streaming units. You would be sorting around 20*200*10 million/2 = 20 billion group ID pairs (each group of 20 users times the number of groups each user is in/2).

    0 讨论(0)
提交回复
热议问题