问题
I have a python project, whose folder has the structure
main_directory - lib - lib.py
- run - script.py
script.py
is
from lib.lib import add_two
spark = SparkSession \
.builder \
.master('yarn') \
.appName('script') \
.getOrCreate()
print(add_two(1,2))
and lib.py
is
def add_two(x,y):
return x+y
I want to launch as a Dataproc job in GCP. I have checked online, but I have not understood well how to do it. I am trying to launch the script with
gcloud dataproc jobs submit pyspark --cluster=$CLUSTER_NAME --region=$REGION \
run/script.py
But I receive the following error message:
from lib.lib import add_two
ModuleNotFoundError: No module named 'lib.lib'
Could you help me on how I should do to launch the job on Dataproc? The only way I have found to do it is to remove the absolute path, making this change to script.py
:
from lib import add_two
and the launch the job as
gcloud dataproc jobs submit pyspark --cluster=$CLUSTER_NAME --region=$REGION \
--files /lib/lib.py \
/run/script.py
However, I would like to avoid the tedious process to list the files manually every time.
Following the suggestion of @Igor, to pack in a zip file I have found that
zip -j --update -r libpack.zip /projectfolder/* && spark-submit --py-files libpack.zip /projectfolder/run/script.py
works. However, this puts all files in the same root folder in libpack.zip, so if there were files with the same names in subfolders this would not work.
Any suggestions?
回答1:
If you want to preserve project structure when submitting Dataroc job then you should package your project into a .zip
file and specify it in --py-files parameter when submitting a job:
gcloud dataproc jobs submit pyspark --cluster=$CLUSTER_NAME --region=$REGION \
--py-files lib.zip \
run/script.py
To create zip archive you need to run script:
cd main_directory/
zip -x run/script.py -r libs.zip .
Refer to this blog post for more details on how to package dependencies in zip archive for PySpark jobs.
来源:https://stackoverflow.com/questions/61386462/submit-a-python-project-to-dataproc-job