for example, i have a folder:
/
- test.py
- test.yml
and the job is submited to spark cluster with:
gcloud beta dataproc jobs
Currently, as Dataproc is not in beta anymore, in order to direct access a file in the Cloud Storage from the PySpark code, submitting the job with --files
parameter will do the work. SparkFiles
is not required. For example:
gcloud dataproc jobs submit pyspark \
--cluster *cluster name* --region *region name* \
--files gs:/// gs:///filename.py
While reading input from gcs via Spark API, it works with gcs connector.