Quickest algorithm for finding sets with high intersection

后端 未结 4 502
予麋鹿
予麋鹿 2021-02-01 06:37

I have a large number of user IDs (integers), potentially millions. These users all belong to various groups (sets of integers), such that there are on the order of 10 million

4条回答
  •  醉酒成梦
    2021-02-01 07:18

    If the vast majority of intersections are 0, that means the number of non-empty intersections is relatively small. Give this a try:

    • Throw away all sets of size <15 before you start
    • Calculate your lookup from userid -> list of sets to which it belongs
    • Create a map, int>
    • For each user, increment (after creating if necessary), n*(n-1)/2 entries of that map, where n is the number of sets to which the user belongs.
    • When that's finished, scan the map for entries where the value is greater than 15.

    It will use more memory than the simple approach of computing every intersection. In fact it will run up against what's feasible: if each set on average intersects with just 10 others, perhaps in very small intersections, then the map needs 50M entries, which is starting to be a lot of RAM. It's also woefully cache-unfriendly.

    It might be faster than doing all the set-intersections, because the O(n^2) terms relate to the number of non-empty intersections and the number of groups to which each user belongs, rather than to the number of sets.

    Parallelizing isn't trivial, because of the contention on the giant map. However, you can shard that into a map for each thread, and periodically give one thread a new, empty, map and add the results-so-far into the total results. The different threads then run completely independently most of the time, each given a list of users to process.

提交回复
热议问题