Fastest way to perform subset test operation on a large collection of sets with same domain

后端未结

关注

 6  720

Assume we have trillions of sets stored somewhere. The domain for each of these sets is the same. It is also finite and discrete. So each set may be stored as a bit field (eg: 0

相关标签:

6条回答

猫巷女王i

2021-02-10 02:53

A quick glance make me think of BDDs - which is somewhat along the idea of the DAG solution. Alternatively a ZDD.

0 讨论(0)
发布评论:

提交评论
- 加载中...
滥情空心

2021-02-10 02:53

If you can preprocess the sets, the subset relation is representable as a DAG (because you're describing a poset). If the transitive reduction is computed, then I think you can avoid testing all the sets by just performing a DFS starting from the biggest sets and stopping whenever Y is no longer a subset of the current set being visited.

0 讨论(0)
发布评论:

提交评论
- 加载中...
不思量自难忘°

2021-02-10 02:56

Depending on the cardinality of the set from which all the sets are drawn, one option might be to build an inverted index mapping from elements to the sets that contain them. Given a set Y, you could then find all sets that have Y as a subset by finding all of the sets that contain each element individually and computing their intersection. If you store the lists in sorted order (for example, by numbering all the sets in your database with values 0, 1, etc.) then you should be able to compute this intersection fairly efficiently, assuming that no one element is contained in too many sets.

0 讨论(0)
发布评论:

提交评论
- 加载中...
猫巷女王i

2021-02-10 03:06

I tend to say that the answer is no, because of the bit field very low cardinality.

0 讨论(0)
发布评论:

提交评论
- 加载中...
Happy的楠姐

2021-02-10 03:08

If an RDBMS was your only option, I would recommend looking at this interesting article on modelling a DAG in SQL:

http://www.codeproject.com/KB/database/Modeling_DAGs_on_SQL_DBs.aspx?msg=3051183

If you can't afford Oracle or MSSQL, have a look at PostgresQL 9, which supports recursive queries. It's also supported cross joins for quite some time.

0 讨论(0)
发布评论:

提交评论
- 加载中...
执念已碎

2021-02-10 03:09

This would be a stretch on a conventional RDBMS based on your volume, have you looked at Neo4j which is based on a graph storage model?

0 讨论(0)
发布评论:

提交评论
- 加载中...