Use existing SparkSession in POST/batches request

a 夏天 提交于 2019-12-20 04:24:05

问题


I'm trying to use Livy to remotely submit several Spark jobs. Lets say I want to perform following spark-submit task remotely (with all the options as-such)

spark-submit \
--class com.company.drivers.JumboBatchPipelineDriver \
--conf spark.driver.cores=1 \
--conf spark.driver.memory=1g \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.serializer='org.apache.spark.serializer.KryoSerializer' \
--conf "spark.executor.extraJavaOptions= -XX:+UseG1GC" \
--master yarn \
--deploy-mode cluster \
/home/hadoop/y2k-shubham/jars/jumbo-batch.jar \
\
--start=2012-12-21 \
--end=2012-12-21 \
--pipeline=db-importer \
--run-spiders

NOTE: The options after the JAR (--start, --end etc.) are specific to my Spark application. I'm using scopt for this


  • I'm aware that I can supply all the various options in above spark-submit command using Livy POST/batches request.

  • But since I have to make over 250 spark-submits remotely, I'd like to exploit Livy's session-management capabilities; i.e., I want Livy to create a SparkSession once and then use it for all my spark-submit requests.

  • The POST/sessions request allows me to specify quite a few options for instantiating a SparkSession remotely. However, I see no session argument in POST/batches request.

How can I make use of the SparkSession that I created using POST/sessions request for submitting my Spark job using POST/batches request?


I've referred to following examples but they only demonstrate supplying (python) code for Spark job within Livy's POST request

  • pi_app
  • rssanders3/airflow-spark-operator-plugin
  • livy/examples

回答1:


How can I make use of the SparkSession that I created using POST/sessions request for submitting my Spark job using POST/batches request?

  • At this stage, I'm all but certain that this is not possible right now
  • @Luqman Ghani's comment gives a fairly good hint that batch-mode is intended for different use-case than session-mode / LivyClient

The reason I've identified why this isn't possible is (please correct me if I'm wrong / incomplete) as follows

  • POST/batches request accepts JAR
  • This inhibits SparkSession (or spark-shell) from being re-used (without restarting the SparkSession) because
    • How would you remove JAR from previous POST/batches request?
    • How would you add JAR from current POST/batches request?

And here's a more complete picture

  • Actually POST/sessions allows you to pass a JAR
  • but then further interactions with that session (obviously) cannot take JARs
  • they (further interactions) can only be simple scripts (like PySpark: simple python files) that can be loaded into the session (and not JARs)

Possible workaround

  • All those who have their Spark-application written in Scala / Java, which must be bundled in a JAR, will face this difficulty; Python (PySpark) users are lucky here
  • As a possible workaround, you can try this (i see no reason why it wouldn't work)
    • launch a session with your JAR via POST/sessions request
    • then invoke the entrypoint-class from your JAR via python (submit POST /sessions/{sessionId}/statements) as many times as you want (with possibly different parameters). While this wouldn't be straight-forward, it sounds very much possible

Finally I found some more alternatives to Livy for remote spark-submit; see this



来源:https://stackoverflow.com/questions/51746286/use-existing-sparksession-in-post-batches-request

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!