The following can be found on various forums in relation to mapPartitions and map:
First of all this code is not correct. While it looks like an adaptation of the established pattern for foreachPartition it cannot be used with mapPartitions
like this.
Remember that foreachPartition
takes Iterator[_]
and returns Iterator[_]
, where Iterator.map
is lazy, so this code is closing connection before it is actually used.
To use some form of resource, which is initialized in mapPartitions
, you'll have to use design your code in a way, that doesn't require explicit resource release.
the first snippet of text, the database connection be called every time for each element of an RDD using map? I can't seem to find the right reason.
Without the snippet in question the answer must be generic - map
or foreach
are not designed to handle external state. With the API shown your in your question you'd have to:
rdd.map(record => readMatchingFromDB(record, new DbConnection))
which in and obvious way creates connection for each element.
It is not impossible to use for example singleton connection pool, doing something similar to:
object Pool {
lazy val pool = ???
}
rdd.map(record => readMatchingFromDB(record, pool.getConnection))
but it is not always easy to to do it right (think about thread safety). And because connections and similar objects, cannot be in general serialized, we cannot just used closures.
In contrast foreachPartition
pattern is both explicit and simple.
If is of course possible to force eager execution to make things work, for example:
val newRd = myRdd.mapPartitions(
partition => {
val connection = new DbConnection /*creates a db connection per partition*/
val newPartition = partition.map(
record => {
readMatchingFromDB(record, connection)
}).toList
connection.close()
newPartition.toIterator
})
but it is of course risky, can actually decrease performance.
The same things does not happen with sc.textFile ... and reading into dataframes from jdbc connections. Or does it?
Both operate using much lower API, but of course resources are not initialized for each record.