问题
I would like to debate if PlayORM's virtual partitioning is the best way to partition data always, as compared to Cassandra's partitioning.
Schema:
- TimeStamp
- Device ID
- Device Name
- Device Owner
For a TimeStamp, there are 500 K rows, and for a particular Device ID, there are 10 K rows
If I want to partition on 2 columns, say TimeStamp and Device ID. I have following ways this could be done:
- Use PlayORM to 'virtual' partition on both columns, such that data for any virtual partition by any column is distributed on all nodes.
- Use Cassandra's built in partitioning support for one of the columns, and use PlayORM's approach to create 'virtual' partitioning on other columns.
If 'Device ID' was partitioned the 'Cassandra' way, then all the records for a particular 'Device ID' will be stored in disk at contiguous location, and one could carry on with virtual partitioning approach for 'TimeStamp' as playorm does. The reason I may prefer this over PlayORM's approach is that with Cassandra's partition approach, all records of a particular Device ID can be fetched fast if they are in physically contiguous locations on disk, since they are less in numbers (10K only). This may be better than PlayORM's all out approach to distribute records for all partition evenly on nodes, since then the data would be randomly distributed on disk, resulting in many disk seeks, and obviously that would slow things down. So even though in PlayORM's approach, we are doing divide and conquer kind of solution by dividing the rows among nodes in cluster, the speedup due to divide and conquer may be offset by high disk seeks because rows could be randomly scattered all over the node (as against Cassandra's partition, where it would be all together).
Does the above seem to be a valid point, or is there some fault in my understanding?
回答1:
That could potentially be true, but you are also assuming on one cassandra node there will not be many seeks as well due to all the compactions that can occur. Compactions are constantly occuring in cassandra with SizeTiered or Leveled compactions. The best thing may be to just write an actual test case testing both scenarios. Sometimes taking a couple days to really test out theories can payoff big in the end. To really test this well, you may want a 6 node cluster if reads are set to QUOROM(ie. 2 nodes hit for each read). If you have 3 nodes with RF=3, you may see equal performance.
Anyways, there is no substitute for testing. We found out many things that were "said" were wrong until we tested it so it is always better to run the code and see how it works out for your case.
Dean
来源:https://stackoverflow.com/questions/15578699/can-playorm-take-advantage-of-sequential-data-layout