问题
I have written a class implementing a classifier in python. I would like to use Apache Spark to parallelize classification of a huge number of datapoints using this classifier.
- I'm set up using Amazon EC2 on a cluster with 10 slaves, based off an ami that comes with python's Anaconda distribution on it. The ami lets me use IPython Notebook remotely.
- I've defined the class BoTree in a file call BoTree.py on the master in the folder /root/anaconda/lib/python2.7/ which is where all my python modules are
- I've checked that I can import and use BoTree.py when running command line spark from the master (I just have to start by writing import BoTree and my class BoTree becomes available
- I've used spark's /root/spark-ec2/copy-dir.sh script to copy the /python2.7/ directory across my cluster.
- I've ssh-ed into one of the slaves and tried running ipython there, and was able to import BoTree, so I think the module has been sent across the cluster successfully (I can also see the BoTree.py file in the .../python2.7/ folder)
- On the master I've checked I can pickle and unpickle a BoTree instance using cPickle, which I understand is pyspark's serializer.
However, when I do the following:
import BoTree
bo_tree = BoTree.train(data)
rdd = sc.parallelize(keyed_training_points) #create rdd of 10 (integer, (float, float) tuples
rdd = rdd.mapValues(lambda point, bt = bo_tree: bt.classify(point[0], point[1]))
out = rdd.collect()
Spark fails with the error (just the relevant bit I think):
File "/root/spark/python/pyspark/worker.py", line 90, in main
command = pickleSer.loads(command.value)
File "/root/spark/python/pyspark/serializers.py", line 405, in loads
return cPickle.loads(obj)
ImportError: No module named BoroughTree
Can anyone help me? Somewhat desperate...
Thanks
回答1:
Probably the simplest solution is to use pyFiles
argument when you create SparkContext
from pyspark import SparkContext
sc = SparkContext(master, app_name, pyFiles=['/path/to/BoTree.py'])
Every file placed there will be shipped to workers and added to PYTHONPATH
.
If you're working in an interactive mode you have to stop an existing context using sc.stop()
before you create a new one.
Also make sure that Spark worker is actually using Anaconda distribution and not a default Python interpreter. Based on your description it is most likely the problem. To set PYSPARK_PYTHON
you can use conf/spark-env.sh
files.
On a side note copying file to lib
is a rather messy solution. If you want to avoid pushing files using pyFiles
I would recommend creating either plain Python package or Conda package and a proper installation. This way you can easily keep track of what is installed, remove unnecessary packages and avoid some hard to debug problems.
回答2:
Once the SparkContext is acquired, one may also use addPyFile
to subsequently ship a module to each worker.
sc.addPyFile('/path/to/BoTree.py')
pyspark.SparkContext.addPyFile(path) documentation
来源:https://stackoverflow.com/questions/31093179/how-to-use-custom-classes-with-apache-spark-pyspark