In cluster modes, how to write a closure function f
to let every worker access the copy of variable N
.
N=5
lines=sc.parallelize([\'
In other word, can worker access variable N, which is defined outside f1 but used inside f1 in the driver node.
Kind of.
However when this code is compute, Spark will analyze f1
definition, determine variables present in the closure, and serialize these along with f1
.
So when the function is actually invoked a local copy of the parent environment will be present in the scope.
Keeping these two things in mind we can answer the question:
I don't have any cluster and I really want to know if it will work in cluster modes?
Yes, it will work just fine on the distributed cluster.
However if you tried to modify the object passed through closure, the changes won't be propagated and will affect only local copies (in other words, don't even try).