How to build B-tree index using Apache Spark?

强颜欢笑 提交于 2019-12-06 09:17:45

问题


Now I have a set of numbers, such as1,4,10,23,..., and I would like to build a b-tree index for them using Apache Spark. The format is per line per record (separated by '/n'). And I have also no idea of the output file's format, I just want to find a recommend one

The regular way of building b-tree index are shown in https://en.wikipedia.org/wiki/B-tree, but I now would like a distributed parallel version in Apache Spark .

In addition, the Wiki of B-tree introduced a way to build a B-tree to represent a large existing collection of data.(see https://en.wikipedia.org/wiki/B-tree) It seems that I should sort it at advance, and I think for a big set of data, sorting is quite time-consuming and even can't be completed for limited memory. Is this method mentioned above a recommend one ?


回答1:


Sort the RDD with RDD.sort if it's not already sorted. Use RDD.mapPartitions to build an index for each partition. Then build a top-level index that connects the per-partition indices.



来源:https://stackoverflow.com/questions/28911117/how-to-build-b-tree-index-using-apache-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!