I have a large number of user IDs (integers), potentially millions. These users all belong to various groups (sets of integers), such that there are on the order of 10 million
I would do exactly what you propose: map users to their group. That is, I would keep a list of group ids for every user. Then I would use the following algorithm:
foreach group:
map = new Map // maps groups to count
foreach user in group:
foreach userGroup in user.groups:
map[userGroup]++
if( map[userGroup] == 15 && userGroup.id > group.id )
largeIntersection( group, userGroup )
Given you have G
groups each containing U
users on average, and given that these users belong to g
groups on average, then this will run in O( G*U*g )
. Which, given your problem, is probably much faster than the naive pairwise comparison of groups which runs in O(G*G*U)
.