Quickest algorithm for finding sets with high intersection

后端 未结 4 503
予麋鹿
予麋鹿 2021-02-01 06:37

I have a large number of user IDs (integers), potentially millions. These users all belong to various groups (sets of integers), such that there are on the order of 10 million

4条回答
  •  长情又很酷
    2021-02-01 07:19

    One way is to see your problem as a metric space radius search problem where the distance function is the number of not matching entries and the radius is r = max(number of elements in sets) - number of equal. Filtering of the found elements is needed to see that enough values are in the Set. So unless someone comes up with a metric function that can be used directly this solutions has to much constraints.

    One of the data structures for metric search is the BK-Tree that can be used for string similarity searches.

    Candidates for your problem are VP-tree and M-Trees.

    The worst case for metric tree is O(n^2) when you're searching for a distance > m (maximum number of elements in the sets) as you build up the tree in O(log n * n) and search in O(n^2).

    Apart for that the actual runtime complexity depends on the abbility to prune subtrees of the metric tree while executing the search. In a metric tree a subtree can be skipped if the distance of the pivot element to the search element is larger than the radius of the pivot element (which is at least the maximum distance of a ancestores to the pivot element). If your entry sets are rather disjunct the overall runtime will be dominated by the build-up time of the metric tree O(log n * n).

提交回复
热议问题