How to submit multiple Spark applications in parallel without spawning separate JVMs?

三世轮回 提交于 2019-12-05 13:41:55

With a use case, this is much clearer now. There are two possible solutions:

If you require shared data between those jobs, use the FAIR-scheduler and a (REST-)frontend (as does SparkJobServer, Livy, etc.). You don't need to use SparkJobServer either, it should be relatively easy to code, if you have a fixed scope. I've seen projects go in that direction. All you need is an event loop and a way to translate your incoming queries into Spark queries. In a way, I would expect there to be demand for a library to cover this use case, since it's pretty much always the first thing you have to build, when you work on a Spark-based application/framework. In this case, you can size your executors according to your hardware, Spark will manage scheduling of your jobs. With Yarn's dynamic resource allocation, Yarn will also free resources (kill executors), should your framework/app be idle. For more information, read here: http://spark.apache.org/docs/latest/job-scheduling.html

If you don't need shared data, use YARN (or another resource manager) to assign resources in a fair manner to both jobs. YARN has a fair scheduling mode, and you can set the resource demands per application. If you think this suits you, but you need shared data, then you might want to think about using Hive or Alluxio to provide a data interface. In this case you would run two spark-submits, and maintain multiple drivers in the cluster. Building additional automation around spark-submit can help you make this less annoying and more transparent to end users. This approach is also high-latency, since resource allocation and SparkSession initialization take up a more or less constant amount of time.

tl;dr I'd say it's not possible.

A Spark application is at least one JVM and it's at spark-submit time when you specify the requirements of the single JVM (or a bunch of JVMs that act like executors).

If however you want to have different JVM configurations without launching separate JVMs, that does not seem possible (even outside Spark but assuming JVM is in use).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!