I need a way of storing sets of arbitrary size for fast query later on. I\'ll be needing to query the resulting data structure for subsets or sets that are already stored.
If I understand your needs correctly, you need a multi-state storing data structure, with retrievals on combinations of these states.
If the states are binary (as in your examples: Has milk/doesn't have milk, has sugar/doesn't have sugar) or could be converted to binary(by possibly adding more states) then you have a lightning speed algorithm for your purpose: Bitmap Indices
Bitmap indices can do such comparisons in memory and there literally is nothing in comparison on speed with these (ANDing bits is what computers can really do the fastest).
http://en.wikipedia.org/wiki/Bitmap_index
Here's the link to the original work on this simple but amazing data structure: http://www.sciencedirect.com/science/article/pii/0306457385901086
Almost all SQL databases supoort Bitmap Indexing and there are several possible optimizations for it as well(by compression etc.):
MS SQL: http://technet.microsoft.com/en-us/library/bb522541(v=sql.105).aspx
Oracle: http://www.orafaq.com/wiki/Bitmap_index
Edit:
Apparently the original research work on bitmap indices is no longer available for free public access.
Links to recent literature on this subject:
This problem is known in the literature as subset query. It is equivalent to the "partial match" problem (e.g.: find all words in a dictionary matching A??PL? where ? is a "don't care" character).
One of the earliest results in this area is from this paper by Ron Rivest from 19761. This2 is a more recent paper from 2002. Hopefully, this will be enough of a starting point to do a more in-depth literature search.
Rivest, Ronald L. "Partial-match retrieval algorithms." SIAM Journal on Computing 5.1 (1976): 19-50.
Charikar, Moses, Piotr Indyk, and Rina Panigrahy. "New algorithms for subset query, partial match, orthogonal range searching, and related problems." Automata, Languages and Programming. Springer Berlin Heidelberg, 2002. 451-462.
This seems like a custom made problem for a graph database. You make a node for each set or subset, and a node for each element of a set, and then you link the nodes with a relationship Contains. E.g.:
Now you put all the elements A,B,C,D,E in an index/hash table, so you can find a node in constant time in the graph. Typical performance for a query [A,B,C] will be the order of the smallest node, multiplied by the size of a typical set. E.g. to find {A,B,C] I find the order of A is one, so I look at all the sets A is in, S1, and then I check that it has all of BC, since the order of S1 is 4, I have to do a total of 4 comparisons.
A prebuilt graph database like Neo4j comes with a query language, and will give good performance. I would imagine, provided that the typical orders of your database is not large, that its performance is far superior to the algorithms based on set representations.