Submit a Python project to Dataproc job

问题

I have a python project, whose folder has the structure

main_directory - lib - lib.py
               - run - script.py

script.py is

from lib.lib import add_two
spark = SparkSession \
    .builder \
    .master('yarn') \
    .appName('script') \
    .getOrCreate()

print(add_two(1,2))

and lib.py is

def add_two(x,y):
    return x+y

I want to launch as a Dataproc job in GCP. I have checked online, but I have not understood well how to do it. I am trying to launch the script with

gcloud dataproc jobs submit pyspark --cluster=$CLUSTER_NAME --region=$REGION \
  run/script.py

But I receive the following error message:

from lib.lib import add_two
ModuleNotFoundError: No module named 'lib.lib'

Could you help me on how I should do to launch the job on Dataproc? The only way I have found to do it is to remove the absolute path, making this change to script.py:

 from lib import add_two

and the launch the job as

gcloud dataproc jobs submit pyspark --cluster=$CLUSTER_NAME --region=$REGION \
  --files /lib/lib.py \
  /run/script.py

However, I would like to avoid the tedious process to list the files manually every time.

Following the suggestion of @Igor, to pack in a zip file I have found that

zip -j --update -r libpack.zip /projectfolder/* && spark-submit --py-files libpack.zip /projectfolder/run/script.py

works. However, this puts all files in the same root folder in libpack.zip, so if there were files with the same names in subfolders this would not work.

Any suggestions?

回答1:

If you want to preserve project structure when submitting Dataroc job then you should package your project into a .zip file and specify it in --py-files parameter when submitting a job:

gcloud dataproc jobs submit pyspark --cluster=$CLUSTER_NAME --region=$REGION \
  --py-files lib.zip \
  run/script.py

To create zip archive you need to run script:

cd main_directory/
zip -x run/script.py -r libs.zip .

Refer to this blog post for more details on how to package dependencies in zip archive for PySpark jobs.

来源：https://stackoverflow.com/questions/61386462/submit-a-python-project-to-dataproc-job

标签

python

pyspark

google-cloud-dataproc