Can PlayORM take advantage of sequential data layout?

时光毁灭记忆、已成空白 提交于 2019-12-13 03:59:35

问题


I would like to debate if PlayORM's virtual partitioning is the best way to partition data always, as compared to Cassandra's partitioning.

Schema:

  • TimeStamp
  • Device ID
  • Device Name
  • Device Owner

For a TimeStamp, there are 500 K rows, and for a particular Device ID, there are 10 K rows

If I want to partition on 2 columns, say TimeStamp and Device ID. I have following ways this could be done:

  1. Use PlayORM to 'virtual' partition on both columns, such that data for any virtual partition by any column is distributed on all nodes.
  2. Use Cassandra's built in partitioning support for one of the columns, and use PlayORM's approach to create 'virtual' partitioning on other columns.

If 'Device ID' was partitioned the 'Cassandra' way, then all the records for a particular 'Device ID' will be stored in disk at contiguous location, and one could carry on with virtual partitioning approach for 'TimeStamp' as playorm does. The reason I may prefer this over PlayORM's approach is that with Cassandra's partition approach, all records of a particular Device ID can be fetched fast if they are in physically contiguous locations on disk, since they are less in numbers (10K only). This may be better than PlayORM's all out approach to distribute records for all partition evenly on nodes, since then the data would be randomly distributed on disk, resulting in many disk seeks, and obviously that would slow things down. So even though in PlayORM's approach, we are doing divide and conquer kind of solution by dividing the rows among nodes in cluster, the speedup due to divide and conquer may be offset by high disk seeks because rows could be randomly scattered all over the node (as against Cassandra's partition, where it would be all together).

Does the above seem to be a valid point, or is there some fault in my understanding?


回答1:


That could potentially be true, but you are also assuming on one cassandra node there will not be many seeks as well due to all the compactions that can occur. Compactions are constantly occuring in cassandra with SizeTiered or Leveled compactions. The best thing may be to just write an actual test case testing both scenarios. Sometimes taking a couple days to really test out theories can payoff big in the end. To really test this well, you may want a 6 node cluster if reads are set to QUOROM(ie. 2 nodes hit for each read). If you have 3 nodes with RF=3, you may see equal performance.

Anyways, there is no substitute for testing. We found out many things that were "said" were wrong until we tested it so it is always better to run the code and see how it works out for your case.

Dean



来源:https://stackoverflow.com/questions/15578699/can-playorm-take-advantage-of-sequential-data-layout

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!