Data structure for matching sets

前端 未结 13 1149
有刺的猬
有刺的猬 2021-02-02 00:14

I have an application where I have a number of sets. A set might be
{4, 7, 12, 18}
unique numbers and all less than 50.

I then have several data items:
1 {1,

相关标签:
13条回答
  • 2021-02-02 00:37

    This is not a real answer more an observation: this problem looks like it could be efficiently parallellized or even distributed, which would at least reduce the running time to O(n / number of cores)

    0 讨论(0)
  • 2021-02-02 00:38

    If you're going to improve performance, you're going to have to do something fancy to reduce the number of set comparisons you make.

    Maybe you can partition the data items so that you have all those where 1 is the smallest element in one group, and all those where 2 is the smallest item in another group, and so on.

    When it comes to searching, you find the smallest value in the search set, and look at the group where that value is present.

    Or, perhaps, group them into 50 groups by 'this data item contains N' for N = 1..50.

    When it comes to searching, you find the size of each group that holds each element of the set, and then search just the smallest group.

    The concern with this - especially the latter - is that the overhead of reducing the search time might outweigh the performance benefit from the reduced search space.

    0 讨论(0)
  • 2021-02-02 00:39

    I can't prove it, but I'm fairly certain that there is no solution that can easily beat the O(n) bound. Your problem is "too general": every set has m = 50 properties (namely, property k is that it contains the number k) and the point is that all these properties are independent of each other. There aren't any clever combinations of properties that can predict the presence of other properties. Sorting doesn't work because the problem is very symmetric, any permutation of your 50 numbers will give the same problem but screw up any kind of ordering. Unless your input has a hidden structure, you're out of luck.

    However, there is some room for speed / memory tradeoffs. Namely, you can precompute the answers for small queries. Let Q be a query set, and supersets(Q) be the collection of sets that contain Q, i.e. the solution to your problem. Then, your problem has the following key property

    Q ⊆ P  =>  supersets(Q) ⊇ supersets(P)
    

    In other words, the results for P = {1,3,4} are a subcollection of the results for Q = {1,3}.

    Now, precompute all answers for small queries. For demonstration, let's take all queries of size <= 3. You'll get a table

    supersets({1})
    supersets({2})
    ...
    supersets({50})
    supersets({1,2})
    supersets({2,3})
    ...
    supersets({1,2,3})
    supersets({1,2,4})
    ...
    
    supersets({48,49,50})
    

    with O(m^3) entries. To compute, say, supersets({1,2,3,4}), you look up superset({1,2,3}) and run your linear algorithm on this collection. The point is that on average, superset({1,2,3}) will not contain the full n = 50,000 elements, but only a fraction n/2^3 = 6250 of those, giving an 8-fold increase in speed.

    (This is a generalization of the "reverse index" method that other answers suggested.)

    Depending on your data set, memory use will be rather terrible, though. But you might be able to omit some rows or speed up the algorithm by noting that a query like {1,2,3,4} can be calculated from several different precomputed answers, like supersets({1,2,3}) and supersets({1,2,4}), and you'll use the smallest of these.

    0 讨论(0)
  • 2021-02-02 00:45

    You could use inverted index of your data items. For your example

    1 {1, 2, 4, 7, 8, 12, 18, 23, 29}
    2 {3, 4, 6, 7, 15, 23, 34, 38}
    3 {4, 7, 12, 18}
    4 {1, 4, 7, 12, 13, 14, 15, 16, 17, 18}
    5 {2, 4, 6, 7, 13, 15}
    

    the inverted index will be

    1: {1, 4}
    2: {1, 5}
    3: {2}
    4: {1, 2, 3, 4, 5}
    5: {}
    6: {2, 5}
    ...
    

    So, for any particular set {x_0, x_1, ..., x_i} you need to intersect sets for x_0, x_1 and others. For example, for the set {2,3,4} you need to intersect {1,5} with {2} and with {1,2,3,4,5}. Because you could have all your sets in inverted index sorted, you could intersect sets in min of lengths of sets that are to be intersected.

    Here could be an issue, if you have very 'popular' items (as 4 in our example) with huge index set.

    Some words about intersecting. You could use sorted lists in inverted index, and intersect sets in pairs (in increasing length order). Or as you have no more than 50K items, you could use compressed bit sets (about 6Kb for every number, fewer for sparse bit sets, about 50 numbers, not so greedily), and intersect bit sets bitwise. For sparse bit sets that will be efficiently, I think.

    0 讨论(0)
  • 2021-02-02 00:46

    Another idea is to completely prehunt your elephants.

    Setup

    Create a 64 bit X 50,000 element bit array.

    Analyze your search set, and set the corresponding bits in each row.

    Save the bit map to disk, so it can be reloaded as needed.

    Searching

    Load the element bit array into memory.

    Create a bit map array, 1 X 50000. Set all of the values to 1. This is the search bit array

    Take your needle, and walk though each value. Use it as a subscript into the element bit array. Take the corresponding bit array, then AND it into the search array.

    Do that for all values in your needle, and your search bit array, will hold a 1, for every matching element.

    Reconstruct

    Walk through the search bit array, and for each 1, you can use the element bit array, to reconstruct the original values.

    0 讨论(0)
  • 2021-02-02 00:47

    I'm surprised no one has mentioned that the STL contains an algorithm to handle this sort of thing for you. Hence, you should use includes. As it describes it performs at most 2*(N+M)-1 comparisons for a worst case performance of O(M+N).

    Hence:

    bool isContained = includes( myVector.begin(), myVector.end(), another.begin(), another.end() );
    

    if you're needing O( log N ) time, I'll have to yield to the other responders.

    0 讨论(0)
提交回复
热议问题