I have a large number of user IDs (integers), potentially millions. These users all belong to various groups (sets of integers), such that there are on the order of 10 million
I would do exactly what you propose: map users to their group. That is, I would keep a list of group ids for every user. Then I would use the following algorithm:
foreach group:
map = new Map<Group, int> // maps groups to count
foreach user in group:
foreach userGroup in user.groups:
map[userGroup]++
if( map[userGroup] == 15 && userGroup.id > group.id )
largeIntersection( group, userGroup )
Given you have G
groups each containing U
users on average, and given that these users belong to g
groups on average, then this will run in O( G*U*g )
. Which, given your problem, is probably much faster than the naive pairwise comparison of groups which runs in O(G*G*U)
.
If the vast majority of intersections are 0, that means the number of non-empty intersections is relatively small. Give this a try:
map<pair<userset, userset>, int>
n*(n-1)/2
entries of that map, where n is the number of sets to which the user belongs.It will use more memory than the simple approach of computing every intersection. In fact it will run up against what's feasible: if each set on average intersects with just 10 others, perhaps in very small intersections, then the map needs 50M entries, which is starting to be a lot of RAM. It's also woefully cache-unfriendly.
It might be faster than doing all the set-intersections, because the O(n^2) terms relate to the number of non-empty intersections and the number of groups to which each user belongs, rather than to the number of sets.
Parallelizing isn't trivial, because of the contention on the giant map. However, you can shard that into a map for each thread, and periodically give one thread a new, empty, map and add the results-so-far into the total results. The different threads then run completely independently most of the time, each given a list of users to process.
One way is to see your problem as a metric space radius search problem where the distance function is the number of not matching entries and the radius is r = max(number of elements in sets) - number of equal
. Filtering of the found elements is needed to see that enough values are in the Set. So unless someone comes up with a metric function that can be used directly this solutions has to much constraints.
One of the data structures for metric search is the BK-Tree that can be used for string similarity searches.
Candidates for your problem are VP-tree and M-Trees.
The worst case for metric tree is O(n^2) when you're searching for a distance > m (maximum number of elements in the sets) as you build up the tree in O(log n * n) and search in O(n^2).
Apart for that the actual runtime complexity depends on the abbility to prune subtrees of the metric tree while executing the search. In a metric tree a subtree can be skipped if the distance of the pivot element to the search element is larger than the radius of the pivot element (which is at least the maximum distance of a ancestores to the pivot element). If your entry sets are rather disjunct the overall runtime will be dominated by the build-up time of the metric tree O(log n * n).
Should I compare every pair of sets? (If I keep a data structure that maps userIDs to set membership, this would not be necessary.)
In order to count the degree of intersection, you still need to visit the other groups the user has, which is still cubic. You could have a hash table or other sparse array to count, but it would still at best require an increment for each user for each pair of groups each user is in. If you have N users in G groups with an average of S users per group and T number of groups each user is in, you've got GGS/2 for comparing each pair of groups and NTT if you have an index of user to group. T = GS/N, so NTT=GGSS/N; for S=20 and N in the millions there should be an advantage. Unfortunately, you also need at least G*G storage for the intersection count (25 TB or so for a 4 bit non-sparse counter), and have to ensure the structure can be incremented in parallel.
For a million users in 10 million groups of 20, very approximately the probability that a user is in a given group is 2e-6, and the probability that two groups will share users will be 40e-6, so the 25TB comes down to 1GB data, so not impossible for a sparse array on an ordinary sized computer.
However, comparing set of 20 elements for 15 elements in common has more obvious optimisations
There is also the option of a hybrid approach, using the user->group map to prune the set of group to group comparisons to be made. This has the advantage of not requiring incrementing of a shared data structure:
Using merge sort, this is very each to parallelise into pure streaming units. You would be sorting around 20*200*10 million/2 = 20 billion group ID pairs (each group of 20 users times the number of groups each user is in/2).