I want to expose my Spark applications to the users with a web application.
Basically, the user can decide which action he wants to run and enter a few variables, wh
Basically you can use SparkLauncher class to launch Spark applications and add some listeners to watch progress.
However you may be interested in Livy server, which is a RESTful Sever for Spark jobs. As far as I know, Zeppelin is using Livy to submit jobs and retrieve status.
You can also use Spark REST interface to check state, information will be then more precise. Here there is an example how to submit job via REST API
You've got 3 options, the answer is - check by yourself ;) It very depends on your project and requirements. Both 2 main options:
Should be good for you and you must just check what's easier and better to use in your project
You can use Spark from your application in different ways, depending on what you need and what you prefer.
SparkLauncher is a class from spark-launcher
artifact. It is used to launch already prepared Spark jobs just like from Spark Submit.
Typical usage is:
1) Build project with your Spark job and copy JAR file to all nodes 2) From your client application, i.e. web application, create SparkLauncher which points to prepared JAR file
SparkAppHandle handle = new SparkLauncher()
.setSparkHome(SPARK_HOME)
.setJavaHome(JAVA_HOME)
.setAppResource(pathToJARFile)
.setMainClass(MainClassFromJarWithJob)
.setMaster("MasterAddress
.startApplication();
// or: .launch().waitFor()
startApplication
creates SparkAppHandle which allows you to add listeners and stop application. It also provides possibility to getAppId
.
SparkLauncher should be used with Spark REST API. You can query http://driverNode:4040/api/v1/applications/*ResultFromGetAppId*/jobs
and you will have information about current status of an application.
There is also possibility to submit Spark jobs directly via RESTful API. Usage is very similar to SparkLauncher
, but it's done in pure RESTful way.
Example request - credits for this article :
curl -X POST http://spark-master-host:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
"action" : "CreateSubmissionRequest",
"appArgs" : [ "myAppArgument1" ],
"appResource" : "hdfs:///filepath/spark-job-1.0.jar",
"clientSparkVersion" : "1.5.0",
"environmentVariables" : {
"SPARK_ENV_LOADED" : "1"
},
"mainClass" : "spark.ExampleJobInPreparedJar",
"sparkProperties" : {
"spark.jars" : "hdfs:///filepath/spark-job-1.0.jar",
"spark.driver.supervise" : "false",
"spark.app.name" : "ExampleJobInPreparedJar",
"spark.eventLog.enabled": "true",
"spark.submit.deployMode" : "cluster",
"spark.master" : "spark://spark-cluster-ip:6066"
}
}'
This command will submit job in ExampleJobInPreparedJar
class to cluster with given Spark Master. In the response you will have submissionId
field, which will be helpful to check status of application - simply call another service: curl http://spark-cluster-ip:6066/v1/submissions/status/submissionIdFromResponse
. That's it, nothing more to code
Livy REST Server and Spark Job Server are RESTful applications which allows you to submit jobs via RESTful Web Service. One major difference between those two and Spark's REST interface is that Livy and SJS doesn't require jobs to be prepared earlier and packed to JAR file. You are just submitting code which will be executed in Spark.
Usage is very simple. Codes are taken from Livy repository, but with some cuts to improve readability
1) Case 1: submitting job, that is placed in local machine
// creating client
LivyClient client = new LivyClientBuilder()
.setURI(new URI(livyUrl))
.build();
try {
// sending and submitting JAR file
client.uploadJar(new File(piJar)).get();
// PiJob is a class that implements Livy's Job
double pi = client.submit(new PiJob(samples)).get();
} finally {
client.stop(true);
}
2) Case 2: dynamic job creation and execution
// example in Python. Data contains code in Scala, that will be executed in Spark
data = {
'code': textwrap.dedent("""\
val NUM_SAMPLES = 100000;
val count = sc.parallelize(1 to NUM_SAMPLES).map { i =>
val x = Math.random();
val y = Math.random();
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _);
println(\"Pi is roughly \" + 4.0 * count / NUM_SAMPLES)
""")
}
r = requests.post(statements_url, data=json.dumps(data), headers=headers)
pprint.pprint(r.json())
As you can see, both pre-compiled jobs and ad - hoc queries to Spark are possible.
Another Spark as a Service application. Mist is very simple and similar to Livy and Spark Job Server.
Usage is very very similar
1) Create job file:
import io.hydrosphere.mist.MistJob
object MyCoolMistJob extends MistJob {
def doStuff(parameters: Map[String, Any]): Map[String, Any] = {
val rdd = context.parallelize()
...
return result.asInstance[Map[String, Any]]
}
}
2) Package job file into JAR 3) Send request to Mist:
curl --header "Content-Type: application/json" -X POST http://mist_http_host:mist_http_port/jobs --data '{"path": "/path_to_jar/mist_examples.jar", "className": "SimpleContext$", "parameters": {"digits": [1, 2, 3, 4, 5, 6, 7, 8, 9, 0]}, "namespace": "foo"}'
One strong thing, that I can see in Mist, is that it has out-of-the-box support for streaming jobs via MQTT.
Apache Toree was created to enable easy interactive analitics for Spark. It doesn't require any JAR to be built. It's working via IPython protocol, but not only Python is supported.
Currently documentation focuses on Jupyter notebook support, but there is also REST-style API.
I've listed few options:
All of them are good for different use cases. I can distinguish few categories:
SparkLauncher is very simple and is a part of Spark project. You are writing job configuration in plain code, so it can be easier to build than JSON objects.
For fully RESTful-style submitting, consider Spark REST API, Livy, SJS and Mist. Three of them are stable projects, which have some production use cases. REST API also requires jobs to be pre-packaged and Livy and SJS don't. However remember, that Spark REST API is by default in each Spark distribution and Livy/SJS is not. I don't know much about Mist, but - after a while - it should be very good tool to integrate all types of Spark jobs.
Toree is focusing on interactive jobs. It's still in incubation, but even now you can check it's possibilities.
Why use custom, additional REST Service, when there is built-in REST API? SaaS like Livy is one entry point to Spark. It manages Spark context and is only on one node than can in other place than cluster. They also enables interactive analytics. Apache Zeppelin uses Livy to submit user's code to Spark
Here an example of SparkLauncher T.Gawęda mentioned:
SparkAppHandle handle = new SparkLauncher()
.setSparkHome(SPARK_HOME)
.setJavaHome(JAVA_HOME)
.setAppResource(SPARK_JOB_JAR_PATH)
.setMainClass(SPARK_JOB_MAIN_CLASS)
.addAppArgs("arg1", "arg2")
.setMaster("yarn-cluster")
.setConf("spark.dynamicAllocation.enabled", "true")
.startApplication();
Here you can find an example of java web application with Spark job bundled together in a single project. Through SparkLauncher
you can get SparkAppHandle
which you can use to get info about job status. If you need a progress status you can use Spark rest-api:
http://driverHost:4040/api/v1/applications/[app-id]/jobs
The only dependency you will need for SparkLauncher
:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-launcher_2.10</artifactId>
<version>2.0.1</version>
</dependency>
You Can use PredictionIO PredictionIO, a machine learning server for developers and ML engineers. https://github.com/apache/predictionio