Data structure for matching sets

前端 未结 13 1123
有刺的猬
有刺的猬 2021-02-02 00:14

I have an application where I have a number of sets. A set might be
{4, 7, 12, 18}
unique numbers and all less than 50.

I then have several data items:
1 {1,

相关标签:
13条回答
  • 2021-02-02 00:25

    Put your sets into an array (not a linked list) and SORT THEM. The sorting criteria can be either 1) the number of elements in the set (number of 1-bits in the set representation), or 2) the lowest element in the set. For example, let A={7, 10, 16} and B={11, 17}. Then B<A under criterion 1), and A<B under criterion 2). Sorting is O(n log n), but I assume that you can afford some preprocessing time, i.e., that the search structure is static.

    When a new data item arrives, you can use binary search (logarithmic time) to find the starting candidate set in the array. Then you search linearly through the array and test the data item against the set in the array until the data item becomes "greater" than the set.

    You should choose your sorting criterion based on the spread of your sets. If all sets have 0 as their lowest element, you shouldn't choose criterion 2). Vice-versa, if the distribution of set cardinalities is not uniform, you shouldn't choose criterion 1).

    Yet another, more robust, sorting criterion would be to compute the span of elements in each set, and sort them according to that. For example, the lowest element in set A is 7, and highest is 16, so you would encode its span as 0x1007; similarly the B's span would be 0x110B. Sort the sets according to the "span code" and again use binary search to find all sets with the same "span code" as your data item.

    Computing the "span code" is slow in ordinary C, but it can be done fast if you resort to assembly -- most CPUs have instructions that find the most/least significant set bit.

    0 讨论(0)
  • 2021-02-02 00:31

    You can build a reverse index of "haystack" lists that contain each element:

    std::set<int> needle;  // {4, 7, 12, 18}
    std::vector<std::set<int>> haystacks;
    // A list of your each of your data sets:
    // 1 {1, 2, 4, 7, 8, 12, 18, 23, 29}
    // 2 {3, 4, 6, 7, 15, 23, 34, 38}
    // 3 {4, 7, 12, 18}
    // 4 {1, 4, 7, 12, 13, 14, 15, 16, 17, 18}
    // 5 {2, 4, 6, 7, 13, 
    
    std::hash_map[int, set<int>>  element_haystacks;
    // element_haystacks maps each integer to the sets that contain it
    // (the key is the integers from the haystacks sets, and 
    // the set values are the index into the 'haystacks' vector):
    // 1 -> {1, 4}  Element 1 is in sets 1 and 4.
    // 2 -> {1, 5}  Element 2 is in sets 2 and 4.
    // 3 -> {2}  Element 3 is in set 3.
    // 4 -> {1, 2, 3, 4, 5}  Element 4 is in sets 1 through 5.  
    std::set<int> answer_sets;  // The list of haystack sets that contain your set.
    for (set<int>::const_iterator it = needle.begin(); it != needle.end(); ++it) {
      const std::set<int> &new_answer = element_haystacks[i];
      std::set<int> existing_answer;
      std::swap(existing_answer, answer_sets);
      // Remove all answers that don't occur in the new element list.
      std::set_intersection(existing_answer.begin(), existing_answer.end(),
                            new_answer.begin(), new_answer.end(),
                            inserter(answer_sets, answer_sets.begin()));
      if (answer_sets.empty()) break;  // No matches :(
    }
    
    // answer_sets now lists the haystack_ids that include all your needle elements.
    for (int i = 0; i < answer_sets.size(); ++i) {
      cout << "set: " << element_haystacks[answer_sets[i]];
    }
    

    If I'm not mistaken, this will have a max runtime of O(k*m), where is the avg number of sets that an integer belongs to and m is the avg size of the needle set (<50). Unfortunately, it'll have a significant memory overhead due to building the reverse mapping (element_haystacks).

    I'm sure you could improve this a bit if you stored sorted vectors instead of sets and element_haystacks could be a 50 element vector instead of a hash_map.

    0 讨论(0)
  • 2021-02-02 00:31

    How many data items do you have? Are they really all unique? Could you cache popular data items, or use a bucket/radix sort before the run to group repeated items together?

    Here is an indexing approach:

    1) Divide the 50-bit field into e.g. 10 5-bit sub-fields. If you really have 50K sets then 3 17-bit chunks might be nearer the mark.

    2) For each set, choose a single subfield. A good choice is the sub-field where that set has the most bits set, with ties broken almost arbitrarily - e.g. use the leftmost such sub-field.

    3) For each possible bit-pattern in each sub-field note down the list of sets which are allocated to that sub-field and match that pattern, considering only the sub-field.

    4) Given a new data item, divide it into its 5-bit chunks and look each up in its own lookup table to get a list of sets to test against. If your data is completely random you get a factor of two speedup or more, depending on how many bits are set in the densest sub-field of each set. If an adversary gets to make up random data for you, perhaps they find data items that almost but not quite match loads of sets and you don't do very well at all.

    Possibly there is scope for taking advantage of any structure in your sets, by numbering bits so that sets tend to have two or more bits in their best sub-field - e.g. do cluster analysis on the bits, treating them as similar if they tend to appear together in sets. Or if you can predict patterns in the data items, alter the allocation of sets to sub-fields in step(2) to reduce the number of expected false matches.

    Addition: How many tables would need to have to guarantee that any 2 bits always fall into the same table? If you look at the combinatorial definition in http://en.wikipedia.org/wiki/Projective_plane, you can see that there is a way to extract collections of 7 bits from 57 (=1 + 7 + 49) bits in 57 different ways so that for any two bits at least one collection contains both of them. Probably not very useful, but it's still an answer.

    0 讨论(0)
  • 2021-02-02 00:35

    A possible way to divvy up the list of bitmaps, would be to create an array of (Compiled Nibble Indicators)

    Let's say one of your 64 bit bitmaps has the bit 0 to bit 8 set.
    In hex we can look at it as 0x000000000000001F

    Now, let's transform that into a simpler and smaller representation. Each 4 bit Nibble, either has at least one bit set, or not. If it does, we represent it as a 1, if not we represent it as a 0.

    So the hex value reduces to bit pattern 0000000000000011, as the right hand 2 nibbles have are the only ones that have bits in them. Create an array, that holds 65536 values, and use them as a head of linked lists, or set of large arrays....

    Compile each of your bit maps, into it's compact CNI. Add it to the correct list, until all of the lists have been compiled.

    Then take your needle. Compile it into its CNI form. Use that to value, to subscript to the head of the list. All bitmaps in that list have a possibility of being a match. All bitmaps in the other lists can not match.

    That is a way to divvy them up.

    Now in practice, I doubt a linked list would meet your performance requirements.

    If you write a function to compile a bit map to CNI, you could use it as a basis to sort your array by the CNI. Then have your array of 65536 heads, simply subscript into the original array as the start of a range.

    Another technique would be to just compile a part of the 64 bit bitmap, so you have fewer heads. Analysis of your patterns should give you an idea of what nibbles are most effective in partitioning them up.

    Good luck to you, and please let us know what you finally end up doing.

    Evil.

    0 讨论(0)
  • 2021-02-02 00:35

    Since the numbers are less than 50, you could build a one-to-one hash using a 64-bit integer and then use bitwise operations to test the sets in O(1) time. The hash creation would also be O(1). I think either an XOR followed by a test for zero or an AND followed by a test for equality would work. (If I understood the problem correctly.)

    0 讨论(0)
  • 2021-02-02 00:37

    The index of the sets that match the search criterion resemble the sets themselves. Instead of having the unique indexes less than 50, we have unique indexes less than 50000. Since you don't mind using a bit of memory, you can precompute matching sets in a 50 element array of 50000 bit integers. Then you index into the precomputed matches and basically just do your ((set & data) == set) but on the 50000 bit numbers which represent the matching sets. Here's what I mean.

    #include <iostream>
    
    enum
    {
        max_sets = 50000, // should be >= 64
        num_boxes = max_sets / 64 + 1,
        max_entry = 50
    };
    
    uint64_t sets_containing[max_entry][num_boxes];
    
    #define _(x) (uint64_t(1) << x)
    
    uint64_t sets[] =
    {
        _(1) | _(2) | _(4) | _(7) | _(8) | _(12) | _(18) | _(23) | _(29),
        _(3) | _(4) | _(6) | _(7) | _(15) | _(23) | _(34) | _(38),
        _(4) | _(7) | _(12) | _(18),
        _(1) | _(4) | _(7) | _(12) | _(13) | _(14) | _(15) | _(16) | _(17) | _(18),
        _(2) | _(4) | _(6) | _(7) | _(13) | _(15),
        0,
    };
    
    void big_and_equals(uint64_t lhs[num_boxes], uint64_t rhs[num_boxes])
    {
        static int comparison_counter = 0;
        for (int i = 0; i < num_boxes; ++i, ++comparison_counter)
        {
            lhs[i] &= rhs[i];
        }
        std::cout
            << "performed "
            << comparison_counter
            << " comparisons"
            << std::endl;
    }
    
    int main()
    {
        // Precompute matches
        memset(sets_containing, 0, sizeof(uint64_t) * max_entry * num_boxes);
    
        int set_number = 0;
        for (uint64_t* p = &sets[0]; *p; ++p, ++set_number)
        {
            int entry = 0;
            for (uint64_t set = *p; set; set >>= 1, ++entry)
            {
                if (set & 1)
                {
                    std::cout
                        << "sets_containing["
                        << entry
                        << "]["
                        << (set_number / 64)
                        << "] gets bit "
                        << set_number % 64
                        << std::endl;
    
                    uint64_t& flag_location =
                        sets_containing[entry][set_number / 64];
    
                    flag_location |= _(set_number % 64);
                }
            }
        }
    
        // Perform search for a key
        int key[] = {4, 7, 12, 18};
        uint64_t answer[num_boxes];
        memset(answer, 0xff, sizeof(uint64_t) * num_boxes);
    
        for (int i = 0; i < sizeof(key) / sizeof(key[0]); ++i)
        {
            big_and_equals(answer, sets_containing[key[i]]);
        }
    
        // Display the matches
        for (int set_number = 0; set_number < max_sets; ++set_number)
        {
            if (answer[set_number / 64] & _(set_number % 64))
            {
                std::cout
                    << "set "
                    << set_number
                    << " matches"
                    << std::endl;
            }
        }
    
        return 0;
    }
    

    Running this program yields:

    sets_containing[1][0] gets bit 0
    sets_containing[2][0] gets bit 0
    sets_containing[4][0] gets bit 0
    sets_containing[7][0] gets bit 0
    sets_containing[8][0] gets bit 0
    sets_containing[12][0] gets bit 0
    sets_containing[18][0] gets bit 0
    sets_containing[23][0] gets bit 0
    sets_containing[29][0] gets bit 0
    sets_containing[3][0] gets bit 1
    sets_containing[4][0] gets bit 1
    sets_containing[6][0] gets bit 1
    sets_containing[7][0] gets bit 1
    sets_containing[15][0] gets bit 1
    sets_containing[23][0] gets bit 1
    sets_containing[34][0] gets bit 1
    sets_containing[38][0] gets bit 1
    sets_containing[4][0] gets bit 2
    sets_containing[7][0] gets bit 2
    sets_containing[12][0] gets bit 2
    sets_containing[18][0] gets bit 2
    sets_containing[1][0] gets bit 3
    sets_containing[4][0] gets bit 3
    sets_containing[7][0] gets bit 3
    sets_containing[12][0] gets bit 3
    sets_containing[13][0] gets bit 3
    sets_containing[14][0] gets bit 3
    sets_containing[15][0] gets bit 3
    sets_containing[16][0] gets bit 3
    sets_containing[17][0] gets bit 3
    sets_containing[18][0] gets bit 3
    sets_containing[2][0] gets bit 4
    sets_containing[4][0] gets bit 4
    sets_containing[6][0] gets bit 4
    sets_containing[7][0] gets bit 4
    sets_containing[13][0] gets bit 4
    sets_containing[15][0] gets bit 4
    performed 782 comparisons
    performed 1564 comparisons
    performed 2346 comparisons
    performed 3128 comparisons
    set 0 matches
    set 2 matches
    set 3 matches
    

    3128 uint64_t comparisons beats 50000 comparisons so you win. Even in the worst case, which would be a key which has all 50 items, you only have to do num_boxes * max_entry comparisons which in this case is 39100. Still better than 50000.

    0 讨论(0)
提交回复
热议问题