In the currently early-release textbook titled High Performance Spark, the developers of Spark note that:
To allow Spark the flexibility to s
One way to prevent forcing the "materialization" of the entire partition is by converting the Iterator
into a Stream, and then using Stream
's functional API (e.g. map
function).
How to convert an iterator to a stream? suggests a few good ways to convert an Iterator
into a Stream
, so taking one of the options suggested there we can end up with:
rdd.mapPartitions((Iterator<InObj> iter) -> {
Iterable<InObj> iterable = () -> iter;
return StreamSupport.stream(iterable.spliterator(), false)
.map(s -> transformRow(s)) // or whatever transformation
.iterator();
});
Which should be an "Itrator-to-Iterator" transformation, because all the intermediate APIs used (Iterable
, Stream
) are lazily evaluated.
EDIT: I haven't tested it myself, but the OP commented, and I quote, that "there is no efficiency increase by using a Stream over a list". I don't know why that is, and I don't know if that would be true in general, but worth mentioning.