I have found that as Spark runs, and tables grow in size (through Joins) that the spark executors will eventually run out of memory and the entire system crashes. Even if I
TL;DR You can have as many SparkSession
s as needed.
You can have one and only one SparkContext
on a single JVM, but the number of SparkSession
s is pretty much unbounded.
But can you elaborate on what you mean by a single SparkContext on a single JVM?
It means that at any given time in the lifecycle of a Spark application the driver can only be one and only one which in turn means that there's one and only one SparkContext
on that JVM available.
The driver of a Spark application is where the SparkContext
lives (or it's the opposite rather where SparkContext
defines the driver -- the distinction is pretty much blurry).
You can only have one SparkContext
at one time. Although you can start and stop it on demand as many times you want, but I remember an issue about it that said you should not close SparkContext
unless you're done with Spark (which usually happens at the very end of your Spark application).
In other words, have a single SparkContext
for the entire lifetime of your Spark application.
There was a similar question What's the difference between SparkSession.sql vs Dataset.sqlContext.sql? about multiple SparkSession
s that can shed more light on why you'd want to have two or more sessions.
I was able call
sparkSession.sparkContext().stop()
, and alsostop
theSparkSession
.
So?! How does this contradict what I said?! You stopped the only SparkContext
available on the JVM. Not a big deal. You could, but that's just one part of "you can only have one and only one SparkContext
on a single JVM available", isn't it?
SparkSession
is a mere wrapper around SparkContext
to offer Spark SQL's structured/SQL features on top of Spark Core's RDDs.
From the point of Spark SQL developer, the purpose of a SparkSession
is to be a namespace for query entities like tables, views or functions that your queries use (as DataFrames, Datasets or SQL) and Spark properties (that could have different values per SparkSession
).
If you'd like to have the same (temporary) table name used for different Datasets, creating two SparkSession
s would be what I'd consider the recommended way.
I've just worked on an example to showcase how whole-stage codegen works in Spark SQL and have created the following that simply turns the feature off.
// both where and select operators support whole-stage codegen
// the plan tree (with the operators and expressions) meets the requirements
// That's why the plan has WholeStageCodegenExec inserted
// You can see stars (*) in the output of explain
val q = Seq((1,2,3)).toDF("id", "c0", "c1").where('id === 0).select('c0)
scala> q.explain
== Physical Plan ==
*Project [_2#89 AS c0#93]
+- *Filter (_1#88 = 0)
+- LocalTableScan [_1#88, _2#89, _3#90]
// Let's break the requirement of having up to spark.sql.codegen.maxFields
// I'm creating a brand new SparkSession with one property changed
val newSpark = spark.newSession()
import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_MAX_NUM_FIELDS
newSpark.sessionState.conf.setConf(WHOLESTAGE_MAX_NUM_FIELDS, 2)
scala> println(newSpark.sessionState.conf.wholeStageMaxNumFields)
2
// Let's see what's the initial value is
// Note that I use spark value (not newSpark)
scala> println(spark.sessionState.conf.wholeStageMaxNumFields)
100
import newSpark.implicits._
// the same query as above but created in SparkSession with WHOLESTAGE_MAX_NUM_FIELDS as 2
val q = Seq((1,2,3)).toDF("id", "c0", "c1").where('id === 0).select('c0)
// Note that there are no stars in the output of explain
// No WholeStageCodegenExec operator in the plan => whole-stage codegen disabled
scala> q.explain
== Physical Plan ==
Project [_2#122 AS c0#126]
+- Filter (_1#121 = 0)
+- LocalTableScan [_1#121, _2#122, _3#123]
I then created a new
SparkSession
and used a newSparkContext
. No error was thrown.
Again, how does this contradict what I said about a single SparkContext
being available? I'm curious.
What exactly does stopping the spark context do, and why can you not create a new one once you've stopped one?
You can no longer use it to run Spark jobs (to process large and distributed datasets) which is pretty much exactly the reason why you use Spark in the first place, doesn't it?
Try the following:
SparkContext
An exception? Right! Remember that you close the "doors" to Spark so how could you have expected to be inside?! :)