Need algorithm for fast storage and retrieval (search) of sets and subsets

前端 未结 5 794
遇见更好的自我
遇见更好的自我 2021-02-06 11:57

I need a way of storing sets of arbitrary size for fast query later on. I\'ll be needing to query the resulting data structure for subsets or sets that are already stored.

相关标签:
5条回答
  • 2021-02-06 12:17

    If I understand your needs correctly, you need a multi-state storing data structure, with retrievals on combinations of these states.

    If the states are binary (as in your examples: Has milk/doesn't have milk, has sugar/doesn't have sugar) or could be converted to binary(by possibly adding more states) then you have a lightning speed algorithm for your purpose: Bitmap Indices

    Bitmap indices can do such comparisons in memory and there literally is nothing in comparison on speed with these (ANDing bits is what computers can really do the fastest).

    http://en.wikipedia.org/wiki/Bitmap_index

    Here's the link to the original work on this simple but amazing data structure: http://www.sciencedirect.com/science/article/pii/0306457385901086

    Almost all SQL databases supoort Bitmap Indexing and there are several possible optimizations for it as well(by compression etc.):

    MS SQL: http://technet.microsoft.com/en-us/library/bb522541(v=sql.105).aspx

    Oracle: http://www.orafaq.com/wiki/Bitmap_index

    Edit: Apparently the original research work on bitmap indices is no longer available for free public access.
    Links to recent literature on this subject:

    • Bitmap Index Design Choices and Their Performance Implications
    • Bitmap Index Design and Evaluation
    • Compressing Bitmap Indexes for Faster Search Operations
    0 讨论(0)
  • 2021-02-06 12:21

    This problem is known in the literature as subset query. It is equivalent to the "partial match" problem (e.g.: find all words in a dictionary matching A??PL? where ? is a "don't care" character).

    One of the earliest results in this area is from this paper by Ron Rivest from 19761. This2 is a more recent paper from 2002. Hopefully, this will be enough of a starting point to do a more in-depth literature search.

    1. Rivest, Ronald L. "Partial-match retrieval algorithms." SIAM Journal on Computing 5.1 (1976): 19-50.

    2. Charikar, Moses, Piotr Indyk, and Rina Panigrahy. "New algorithms for subset query, partial match, orthogonal range searching, and related problems." Automata, Languages and Programming. Springer Berlin Heidelberg, 2002. 451-462.

    0 讨论(0)
  • 2021-02-06 12:26

    This seems like a custom made problem for a graph database. You make a node for each set or subset, and a node for each element of a set, and then you link the nodes with a relationship Contains. E.g.:

    enter image description here

    Now you put all the elements A,B,C,D,E in an index/hash table, so you can find a node in constant time in the graph. Typical performance for a query [A,B,C] will be the order of the smallest node, multiplied by the size of a typical set. E.g. to find {A,B,C] I find the order of A is one, so I look at all the sets A is in, S1, and then I check that it has all of BC, since the order of S1 is 4, I have to do a total of 4 comparisons.

    A prebuilt graph database like Neo4j comes with a query language, and will give good performance. I would imagine, provided that the typical orders of your database is not large, that its performance is far superior to the algorithms based on set representations.

    
    0 讨论(0)
  • 2021-02-06 12:28

    I'm confident that I can now contribute to the solution. One possible quite efficient way is a:

    Trie invented by Frankling Mark Liang

    Such a special tree is used for example in spell checking or autocompletion and that actually comes close to your desired behavior, especially allowing to search for subsets quite conveniently.

    The difference in your case is that you're not interested in the order of your attributes/features. For your case a Set-Trie was invented by Iztok Savnik.

    What is a Set-Tree? A tree where each node except the root contains a single attribute value (number) and a marker (bool) if at this node there is a data entry. Each subtree contains only attributes whose values are larger than the attribute value of the parent node. The root of the Set-Tree is empty. The search key is the path from the root to a certain node of the tree. The search result is the set of paths from the root to all nodes containing a marker that you reach when you go down the tree and up the search key simultaneously (see below).

    But first a drawing by me:

    Simple Set-Trie drawing

    The attributes are {1,2,3,4,5} which can be anything really but we just enumerate them and therefore naturally obtain an order. The data is {{1,2,4}, {1,3}, {1,4}, {2,3,5}, {2,4}} which in the picture is the set of paths from the root to any circle. The circles are the markers for the data in the picture.

    Please note that the right subtree from root does not contain attribute 1 at all. That's the clue.

    Searching including subsets Say you want to search for attributes 4 and 1. First you order them, the search key is {1,4}. Now startin from root you go simultaneously up the search key and down the tree. This means you take the first attribute in the key (1) and go through all child nodes whose attribute is smaller or equal to 1. There is only one, namely 1. Inside you take the next attribute in the key (4) and visit all child nodes whose attribute value is smaller than 4, that are all. You continue until there is nothing left to do and collect all circles (data entries) that have the attribute value exactly 4 (or the last attribute in the key). These are {1,2,4} and {1,4} but not {1,3} (no 4) or {2,4} (no 1).

    Insertion Is very easy. Go down the tree and store a data entry at the appropriate position. For example data entry {2.5} would be stored as child of {2}.

    Add attributes dynamically Is naturally ready, you could immediately insert {1,4,6}. It would come below {1,4} of course.

    I hope you understand what I want to say about Set-Tries. In the paper by Iztok Savnik it's explained in much more detail. They probably are very efficient.

    I don't know if you still want to store the data in a database. I think this would complicate things further and I don't know what is the best to do then.

    0 讨论(0)
  • 2021-02-06 12:34

    How about having an inverse index built of hashes?

    Suppose you have your values int A, char B, bool C of different types. With std::hash (or any other hash function) you can create numeric hash values size_t Ah, Bh, Ch.

    Then you define a map that maps an index to a vector of pointers to the tuples

    std::map<size_t,std::vector<TupleStruct*> > mymap;
    

    or, if you can use global indices, just

    std::map<size_t,std::vector<size_t> > mymap;
    

    For retrieval by queries X and Y, you need to

    1. get hash value of the queries Xh and Yh
    2. get the corresponding "sets" out of mymap
    3. intersect the sets mymap[Xh] and mymap[Yh]
    0 讨论(0)
提交回复
热议问题