问题
My understanding of the mechanics of Spark's code distribution toward the nodes running it is merely cursory, and I fail in having my code successfully run within Spark's mapPartitions
API when I wish to instantiate a class for each partition, with an argument.
The code below worked perfectly, up until I evolved the class MyWorkerClass
to require an argument:
val result : DataFrame =
inputDF.as[Foo].mapPartitions(sparkIterator => {
// (1) initialize heavy class instance once per partition
val workerClassInstance = MyWorkerClass(bar)
// (2) provide an iterator using a function from that class instance
new CloseableIteratorForSparkMapPartitions[Post, Post](sparkIterator, workerClassInstance.recordProcessFunc)
}
The code above worked perfectly well up to the point in time when I had (or chose) to add a constructor argument to my class MyWorkerClass
. The passed argument value turns out as null
in the worker, instead of the real value of bar
. Somehow the serialization of the argument fails to work as intended.
How would you go about this?
Additional Thoughts/Comments
I'll avoid adding the bulky code of CloseableIteratorForSparkMapPartitions
― it merely provides a Spark friendly iterator and might even not be the most elegant implementation in that.
As I understand it, the constructor argument is not being correctly passed to the Spark worker due to how Spark captures state when serializing stuff to send for execution on the Spark worker. However instantiating the class does seamlessly make heavy-to-load assets included in that class ― normally available to the function provided on the last line of my above code; And the class did seem to instantiate per partition. Which is actually a valid if not key use case for using mapPartitions
instead of map
.
It's the passing of an argument to its instantiation, that I am having trouble figuring how to enable or work-around. In my case this argument is a value only known after the program started running (even if always invariant throughout a single execution of my job; it's actually a program argument). I do need it passing along for the initialization of the class.
I tried tinkering to solve, by providing a function which instantiates MyWorkerClass
with its input argument, rather than directly instantiating as above, but this did not solve matters.
The root symptom of the problem is not any exception, but simply that the value of bar
when MyWorkerClass
is instantiated will just be null
, instead of the actual value of bar
which is known in the scope of the code enveloping the code snippet which I included above!
* one related old Spark issue discussion here
来源:https://stackoverflow.com/questions/61527641/spark-serializes-variable-value-as-null-instead-of-its-real-value