Passing typesafe config conf files to DataProcSparkOperator

问题

I am using Google dataproc to submit spark jobs and google cloud composer to schedule them. Unfortunately, I am facing difficulties.

I am relying on .conf files (typesafe config files) to pass arguments to my spark jobs.

I am using the following python code for the airflow dataproc:

t3 = dataproc_operator.DataProcSparkOperator(
   task_id ='execute_spark_job_cluster_test',
   dataproc_spark_jars='gs://snapshots/jars/pubsub-assembly-0.1.14-SNAPSHOT.jar',
   cluster_name='cluster',
   main_class = 'com.organ.ingestion.Main',
   project_id='project',
   dataproc_spark_properties={'spark.driver.extraJavaOptions':'gs://file-dev/fileConf/development.conf'},
   scopes='https://www.googleapis.com/auth/cloud-platform', dag=dag)

But this is not working and I am getting some errors.

Could anyone help me with this?
Basically I want to be able to override the .conf files and pass them as arguments to my DataProcSparkOperator.
I also tried to do

arguments=`'gs://file-dev/fileConf/development.conf'`:

but this didn't take into account the .conf file mentioned in the arguments .

回答1:

tl;dr You need to turn your development.conf file into a dictionary to pass to dataproc_spark_properties.

Full explanation:

There are two main ways to set properties -- on the cluster level and on the job level.

1) Job level

Looks like you are trying to set them on the job level: DataProcSparkOperator(dataproc_spark_properties={'foo': 'bar', 'foo2': 'bar2'}). That's the same as gcloud dataproc jobs submit spark --properties foo=bar,foo2=bar2 or spark-submit --conf foo=bar --conf foo2=bar2. Here is the documentation for per-job properties.

The argument to spark.driver.extraJavaOptions should be command line arguments you would pass to java. For example, -verbose:gc.

2) Cluster level

You can also set properties on a cluster level using DataprocClusterCreateOperator(properties={'spark:foo': 'bar', 'spark:foo2': 'bar2'}), which is the same as gcloud dataproc clusters create --properties spark:foo=bar,spark:foo2=bar2 (documentation). Again, you need to use a dictionary.

Importantly, if you specify properties at the cluster level, you need to prefix them with which config file you want to add the property to. If you use spark:foo=bar, that means add foo=bar to /etc/spark/conf/spark-defaults.conf. There are similar prefixes for yarn-site.xml, etc.

3) Using your .conf file at the cluster level

If you don't want to turn your .conf file into a dictionary, you can also just append it to /etc/spark/conf/spark-defaults.conf using an initialization action when you create the cluster.

E.g. (this is untested):

#!/bin/bash
set -euxo pipefail

gsutil cp gs://path/to/my.conf .
cat my.conf >> /etc/spark/conf/spark-defaults.conf

Note that you want to append to rather than replace the existing config file, just so that you only override the configs you need to.

来源：https://stackoverflow.com/questions/52336677/passing-typesafe-config-conf-files-to-dataprocsparkoperator

标签

apache-spark

airflow

google-cloud-dataproc

typesafe-config

google-cloud-composer