Data structure for matching sets

前端 未结 13 1128
有刺的猬
有刺的猬 2021-02-02 00:14

I have an application where I have a number of sets. A set might be
{4, 7, 12, 18}
unique numbers and all less than 50.

I then have several data items:
1 {1,

13条回答
  •  孤街浪徒
    2021-02-02 00:45

    You could use inverted index of your data items. For your example

    1 {1, 2, 4, 7, 8, 12, 18, 23, 29}
    2 {3, 4, 6, 7, 15, 23, 34, 38}
    3 {4, 7, 12, 18}
    4 {1, 4, 7, 12, 13, 14, 15, 16, 17, 18}
    5 {2, 4, 6, 7, 13, 15}
    

    the inverted index will be

    1: {1, 4}
    2: {1, 5}
    3: {2}
    4: {1, 2, 3, 4, 5}
    5: {}
    6: {2, 5}
    ...
    

    So, for any particular set {x_0, x_1, ..., x_i} you need to intersect sets for x_0, x_1 and others. For example, for the set {2,3,4} you need to intersect {1,5} with {2} and with {1,2,3,4,5}. Because you could have all your sets in inverted index sorted, you could intersect sets in min of lengths of sets that are to be intersected.

    Here could be an issue, if you have very 'popular' items (as 4 in our example) with huge index set.

    Some words about intersecting. You could use sorted lists in inverted index, and intersect sets in pairs (in increasing length order). Or as you have no more than 50K items, you could use compressed bit sets (about 6Kb for every number, fewer for sparse bit sets, about 50 numbers, not so greedily), and intersect bit sets bitwise. For sparse bit sets that will be efficiently, I think.

提交回复
热议问题