发表新帖

发表新帖

Apache Spark: Effectively using mapPartitions in Java

前端未结

关注

 1  1855

In the currently early-release textbook titled High Performance Spark, the developers of Spark note that:

To allow Spark the flexibility to s

相关标签:

1条回答

小鲜肉

2020-12-30 08:44
One way to prevent forcing the "materialization" of the entire partition is by converting the Iterator into a Stream, and then using Stream's functional API (e.g. map function).

How to convert an iterator to a stream? suggests a few good ways to convert an Iterator into a Stream, so taking one of the options suggested there we can end up with:
```
rdd.mapPartitions((Iterator<InObj> iter) -> {
    Iterable<InObj> iterable = () -> iter;
    return StreamSupport.stream(iterable.spliterator(), false)
            .map(s -> transformRow(s)) // or whatever transformation
            .iterator();
});
```
Which should be an "Itrator-to-Iterator" transformation, because all the intermediate APIs used (Iterable, Stream) are lazily evaluated.

EDIT: I haven't tested it myself, but the OP commented, and I quote, that "there is no efficiency increase by using a Stream over a list". I don't know why that is, and I don't know if that would be true in general, but worth mentioning.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题