I get this weird error message
15/01/26 13:05:12 INFO spark.SparkContext: Created broadcast 0 from wholeTextFiles at NativeMethodAccessorImpl.java:-2
Traceba
If it's really a pickling issue for a MethodDescriptorType, you could register how to pickle a MethodDescriptorType, with this:
def _getattr(objclass, name, repr_str):
# hack to grab the reference directly
try:
attr = repr_str.split("'")[3]
return eval(attr+'.__dict__["'+name+'"]')
except:
attr = getattr(objclass,name)
if name == '__dict__':
attr = attr[name]
return attar
def save_wrapper_descriptor(pickler, obj):
pickler = Pickler(file, protocol)
pickler.save_reduce(_getattr, (obj.__objclass__, obj.__name__,
obj.__repr__()), obj=obj)
return
# register the following "type" with:
# Pickler.dispatch[MethodDescriptorType] = save_wrapper_descriptor
MethodDescriptorType = type(type.__dict__['mro'])
Then, if you register the above to the pickling dispatch table that spark
uses (as shown above, or with copy_reg
), it may get past the pickling error.
Spark tries to serialize the connect object so it can be used inside the executors, which will surely fail because a deserialized db connect object can't grant read/write permission to another scope (or even computer). The problem can be reproduced by trying to broadcast the connect object. For this instance there was a problem on serializing an i/o object.
The problem was partly solved by connecting to the database inside the map functions. Since there will be too many connections for each RDD element in the map function, I had to switch to partition processing to reduce the db connections from 20k to about 8-64 (based on number of partitions). Spark developers should consider creating an initialization function/script for the executors to avoid these kind of dead end problems.
So let's say I got this init function executed by every node, then every node will be connected to the database (some conn pool, or separate zookeeper nodes) because the init function and the map functions will share the same scope, and then the problem is gone, so you write faster code than the workaround I found. At the end of the execution spark will free/unload these defined variables and the program will end.