问题
I am using Google dataproc to submit spark jobs and google cloud composer to schedule them. Unfortunately, I am facing difficulties.
I am relying on .conf
files (typesafe config files) to pass arguments to my spark jobs.
I am using the following python code for the airflow dataproc:
t3 = dataproc_operator.DataProcSparkOperator(
task_id ='execute_spark_job_cluster_test',
dataproc_spark_jars='gs://snapshots/jars/pubsub-assembly-0.1.14-SNAPSHOT.jar',
cluster_name='cluster',
main_class = 'com.organ.ingestion.Main',
project_id='project',
dataproc_spark_properties={'spark.driver.extraJavaOptions':'gs://file-dev/fileConf/development.conf'},
scopes='https://www.googleapis.com/auth/cloud-platform', dag=dag)
But this is not working and I am getting some errors.
Could anyone help me with this?
Basically I want to be able to override the .conf
files and pass them as arguments to my DataProcSparkOperator
.
I also tried to do
arguments=`'gs://file-dev/fileConf/development.conf'`:
but this didn't take into account the .conf
file mentioned in the arguments .
回答1:
tl;dr You need to turn your development.conf
file into a dictionary to pass to dataproc_spark_properties
.
Full explanation:
There are two main ways to set properties -- on the cluster level and on the job level.
1) Job level
Looks like you are trying to set them on the job level: DataProcSparkOperator(dataproc_spark_properties={'foo': 'bar', 'foo2': 'bar2'})
. That's the same as gcloud dataproc jobs submit spark --properties foo=bar,foo2=bar2
or spark-submit --conf foo=bar --conf foo2=bar2
. Here is the documentation for per-job properties.
The argument to spark.driver.extraJavaOptions
should be command line arguments you would pass to java. For example, -verbose:gc
.
2) Cluster level
You can also set properties on a cluster level using DataprocClusterCreateOperator(properties={'spark:foo': 'bar', 'spark:foo2': 'bar2'})
, which is the same as gcloud dataproc clusters create --properties spark:foo=bar,spark:foo2=bar2
(documentation). Again, you need to use a dictionary.
Importantly, if you specify properties at the cluster level, you need to prefix them with which config file you want to add the property to. If you use spark:foo=bar
, that means add foo=bar
to /etc/spark/conf/spark-defaults.conf
. There are similar prefixes for yarn-site.xml
, etc.
3) Using your .conf
file at the cluster level
If you don't want to turn your .conf
file into a dictionary, you can also just append it to /etc/spark/conf/spark-defaults.conf
using an initialization action when you create the cluster.
E.g. (this is untested):
#!/bin/bash
set -euxo pipefail
gsutil cp gs://path/to/my.conf .
cat my.conf >> /etc/spark/conf/spark-defaults.conf
Note that you want to append to rather than replace the existing config file, just so that you only override the configs you need to.
来源:https://stackoverflow.com/questions/52336677/passing-typesafe-config-conf-files-to-dataprocsparkoperator