I'm migrating some code from Spark 1.6 to Spark 2.1 and struggling with the following issue:
This worked perfectly in Spark 1.6
import org.apache.spark.sql.types.{LongType, StructField, StructType}
val schema = StructType(Seq(StructField("i", LongType,nullable=true)))
val rows = sparkContext.parallelize(Seq(Row(Some(1L))))
sqlContext.createDataFrame(rows,schema).show
The same code in Spark 2.1.1:
import org.apache.spark.sql.types.{FloatType, LongType, StructField, StructType}
val schema = StructType(Seq(StructField("i", LongType,nullable=true)))
val rows = ss.sparkContext.parallelize(Seq(Row(Some(1L))))
ss.createDataFrame(rows,schema).show
gives the following Runtime exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8.0 (TID 72, i89203.sbb.ch, executor 9): java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: scala.Some is not a valid external type for schema of bigint
So how should I translate such code to Spark 2.x if I want to have nullable Long
's rather than using Option[Long]
?
There is actually an JIRA SPARK-19056 about this issue which is not actually one.
So this behavior is intentional.
Allowing
Option
inRow
is never documented and brings a lot of troubles when we apply the encoder framework to all typed operations. Since Spark 2.0, please useDataset
for typed operation/custom objects. e.g.
val ds = Seq(1 -> None, 2 -> Some("str")).toDS
ds.toDF // schema: <_1: int, _2: string>
The error message is clear which says that Some
is used when bigint
is required
scala.Some is not a valid external type for schema of bigint
So you need to use Option
combining with getOrElse
so that we can define null
when Option
returns nullpointer
. The following code should work for you
val sc = ss.sparkContext
val sqlContext = ss.sqlContext
val schema = StructType(Seq(StructField("i", LongType,nullable=true)))
val rows = sc.parallelize(Seq(Row(Option(1L) getOrElse(null))))
sqlContext.createDataFrame(rows,schema).show
I hope this answer is helpful
来源:https://stackoverflow.com/questions/44324195/problems-to-create-dataframe-from-rows-containing-optiont