问题
Now I have a set of numbers, such as1,4,10,23,...
, and I would like to build a b-tree index
for them using Apache Spark
. The format is per line per record (separated by '/n'). And I have also no idea of the output file's format, I just want to find a recommend one
The regular way of building b-tree
index are shown in https://en.wikipedia.org/wiki/B-tree, but I now would like a distributed parallel version in Apache Spark
.
In addition, the Wiki of B-tree
introduced a way to build a B-tree to represent a large existing collection of data.(see https://en.wikipedia.org/wiki/B-tree) It seems that I should sort it at advance, and I think for a big set of data, sorting is quite time-consuming and even can't be completed for limited memory. Is this method mentioned above a recommend one ?
回答1:
Sort the RDD with RDD.sort
if it's not already sorted. Use RDD.mapPartitions
to build an index for each partition. Then build a top-level index that connects the per-partition indices.
来源:https://stackoverflow.com/questions/28911117/how-to-build-b-tree-index-using-apache-spark