Quickest algorithm for finding sets with high intersection

后端未结

关注

 4  503

予麋鹿 2021-02-01 06:37

I have a large number of user IDs (integers), potentially millions. These users all belong to various groups (sets of integers), such that there are on the order of 10 million

4条回答

长情又很酷 (楼主)

2021-02-01 07:19

One way is to see your problem as a metric space radius search problem where the distance function is the number of not matching entries and the radius is r = max(number of elements in sets) - number of equal. Filtering of the found elements is needed to see that enough values are in the Set. So unless someone comes up with a metric function that can be used directly this solutions has to much constraints.

One of the data structures for metric search is the BK-Tree that can be used for string similarity searches.

Candidates for your problem are VP-tree and M-Trees.

The worst case for metric tree is O(n^2) when you're searching for a distance > m (maximum number of elements in the sets) as you build up the tree in O(log n * n) and search in O(n^2).

Apart for that the actual runtime complexity depends on the abbility to prune subtrees of the metric tree while executing the search. In a metric tree a subtree can be skipped if the distance of the pivot element to the search element is larger than the radius of the pivot element (which is at least the maximum distance of a ancestores to the pivot element). If your entry sets are rather disjunct the overall runtime will be dominated by the build-up time of the metric tree O(log n * n).

0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...