Apache Spark: Effectively using mapPartitions in Java

前端 未结 1 1855
轻奢々
轻奢々 2020-12-30 08:22

In the currently early-release textbook titled High Performance Spark, the developers of Spark note that:

To allow Spark the flexibility to s

相关标签:
1条回答
  • 2020-12-30 08:44

    One way to prevent forcing the "materialization" of the entire partition is by converting the Iterator into a Stream, and then using Stream's functional API (e.g. map function).

    How to convert an iterator to a stream? suggests a few good ways to convert an Iterator into a Stream, so taking one of the options suggested there we can end up with:

    rdd.mapPartitions((Iterator<InObj> iter) -> {
        Iterable<InObj> iterable = () -> iter;
        return StreamSupport.stream(iterable.spliterator(), false)
                .map(s -> transformRow(s)) // or whatever transformation
                .iterator();
    });
    

    Which should be an "Itrator-to-Iterator" transformation, because all the intermediate APIs used (Iterable, Stream) are lazily evaluated.

    EDIT: I haven't tested it myself, but the OP commented, and I quote, that "there is no efficiency increase by using a Stream over a list". I don't know why that is, and I don't know if that would be true in general, but worth mentioning.

    0 讨论(0)
提交回复
热议问题