问题
Using spark 1.6.0 Say i have a class like this
case class MyClass(date: java.util.Date, oid: org.bson.types.ObjectId)
if i have
//rdd: RDD[MyClass]
rdd.toDF("date", "oid")
i get java.lang.UnsupportedOperationException: Schema for type java.util.Date/org.bson.types.ObjectId is not supported
now i know i can make it a java.sql.Date
but let's say MyClass
is depended upon in too many other places to make that change everywhere, that still won't solve the ObjectId
problem.
i am also aware of the UserDefinedType
option. But it seems like that only works if you also create a new class to work with it (and again, signature of MyClass
needs to stay the same)
is there not a way to just register a serializer/deserializer for java.util.Date
and org.bson.types.ObjectId
so that i can call toDF
on the RDD[MyClass]
and it will just work?
UPDATE
so this doesn't exactly answer my question, but it unblocked us, so will share here in the hope that it's helpful for someone else. so most of the json libraries do support this use case, and spark-sql has a built-in sqlContext.read.json(stringRdd).write.parquet("/path/to/output")
. so you can just define the (de)ser for the class using your json lib of choice, serialize to string, then spark-sql can handle the rest
回答1:
It depends on what you mean by just work. To serialize / deserialize an object all you need is a corresponding UserDefinedType
and proper annotations. For example something like this:
@SQLUserDefinedType(udt = classOf[MyClassUDT])
case class MyClass(date: java.util.Date, oid: ObjectId)
class MyClassUDT extends UserDefinedType[MyClass] {
override def sqlType: StructType = StructType(Seq(
StructField("date", DateType, nullable = false),
StructField("oid", StringType, nullable = false)
))
override def serialize(obj: Any): InternalRow = {
obj match {
case MyClass(date, oid) =>
val row = new GenericMutableRow(2)
row(0) = new java.sql.Date(date.getTime)
row(1) = UTF8String.fromString(oid.toString)
row
}
}
override def deserialize(datum: Any): MyClass = {
datum match {
case row: InternalRow =>
val date: java.util.Date = new java.util.Date(
row.get(0, DateType).asInstanceOf[java.sql.Date].getTime()
)
val oid = new ObjectId(row.getString(1))
MyClass(date, oid)
}
}
override def userClass: Class[MyClass] = classOf[MyClass]
}
It doesn't mean that you'll be able to access any method defined on a class or any of its fields. To be able to do that you'll need UDFs.
A little bit closer to seamless integration are Spark Datasets but AFAIK it is not possible to define custom encoders yet.
来源:https://stackoverflow.com/questions/35215458/serialize-deserialize-existing-class-for-spark-sql-dataframe