Data structure for matching sets

前端 未结 13 1144
有刺的猬
有刺的猬 2021-02-02 00:14

I have an application where I have a number of sets. A set might be
{4, 7, 12, 18}
unique numbers and all less than 50.

I then have several data items:
1 {1,

13条回答
  •  失恋的感觉
    2021-02-02 00:39

    I can't prove it, but I'm fairly certain that there is no solution that can easily beat the O(n) bound. Your problem is "too general": every set has m = 50 properties (namely, property k is that it contains the number k) and the point is that all these properties are independent of each other. There aren't any clever combinations of properties that can predict the presence of other properties. Sorting doesn't work because the problem is very symmetric, any permutation of your 50 numbers will give the same problem but screw up any kind of ordering. Unless your input has a hidden structure, you're out of luck.

    However, there is some room for speed / memory tradeoffs. Namely, you can precompute the answers for small queries. Let Q be a query set, and supersets(Q) be the collection of sets that contain Q, i.e. the solution to your problem. Then, your problem has the following key property

    Q ⊆ P  =>  supersets(Q) ⊇ supersets(P)
    

    In other words, the results for P = {1,3,4} are a subcollection of the results for Q = {1,3}.

    Now, precompute all answers for small queries. For demonstration, let's take all queries of size <= 3. You'll get a table

    supersets({1})
    supersets({2})
    ...
    supersets({50})
    supersets({1,2})
    supersets({2,3})
    ...
    supersets({1,2,3})
    supersets({1,2,4})
    ...
    
    supersets({48,49,50})
    

    with O(m^3) entries. To compute, say, supersets({1,2,3,4}), you look up superset({1,2,3}) and run your linear algorithm on this collection. The point is that on average, superset({1,2,3}) will not contain the full n = 50,000 elements, but only a fraction n/2^3 = 6250 of those, giving an 8-fold increase in speed.

    (This is a generalization of the "reverse index" method that other answers suggested.)

    Depending on your data set, memory use will be rather terrible, though. But you might be able to omit some rows or speed up the algorithm by noting that a query like {1,2,3,4} can be calculated from several different precomputed answers, like supersets({1,2,3}) and supersets({1,2,4}), and you'll use the smallest of these.

提交回复
热议问题