Problems to create DataFrame from Rows containing Option[T]

旧巷老猫 提交于 2020-01-11 10:25:33

问题


I'm migrating some code from Spark 1.6 to Spark 2.1 and struggling with the following issue:

This worked perfectly in Spark 1.6

import org.apache.spark.sql.types.{LongType, StructField, StructType}  

val schema = StructType(Seq(StructField("i", LongType,nullable=true)))    
val rows = sparkContext.parallelize(Seq(Row(Some(1L))))
sqlContext.createDataFrame(rows,schema).show

The same code in Spark 2.1.1:

import org.apache.spark.sql.types.{FloatType, LongType, StructField, StructType}

val schema = StructType(Seq(StructField("i", LongType,nullable=true)))
val rows = ss.sparkContext.parallelize(Seq(Row(Some(1L))))
ss.createDataFrame(rows,schema).show

gives the following Runtime exception:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8.0 (TID 72, i89203.sbb.ch, executor 9): java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: scala.Some is not a valid external type for schema of bigint

So how should I translate such code to Spark 2.x if I want to have nullable Long's rather than using Option[Long]?


回答1:


There is actually an JIRA SPARK-19056 about this issue which is not actually one.

So this behavior is intentional.

Allowing Option in Row is never documented and brings a lot of troubles when we apply the encoder framework to all typed operations. Since Spark 2.0, please use Dataset for typed operation/custom objects. e.g.

val ds = Seq(1 -> None, 2 -> Some("str")).toDS
ds.toDF // schema: <_1: int, _2: string>



回答2:


The error message is clear which says that Some is used when bigint is required

scala.Some is not a valid external type for schema of bigint

So you need to use Option combining with getOrElse so that we can define null when Option returns nullpointer. The following code should work for you

val sc = ss.sparkContext
val sqlContext = ss.sqlContext
val schema = StructType(Seq(StructField("i", LongType,nullable=true)))
val rows = sc.parallelize(Seq(Row(Option(1L) getOrElse(null))))
sqlContext.createDataFrame(rows,schema).show

I hope this answer is helpful



来源:https://stackoverflow.com/questions/44324195/problems-to-create-dataframe-from-rows-containing-optiont

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!