If I wanted to create a StructType
(i.e. a DataFrame.schema
) out of a case class
, is there a way to do it without creating a Dat
I know this question is almost a year old but I came across it and thought others who do also might want to know that I have just learned to use this approach:
import org.apache.spark.sql.Encoders
val mySchema = Encoders.product[MyCaseClass].schema
Instead of manually reproducing the logic for creating the implicit Encoder
object that gets passed to toDF
, one can use that directly (or, more precisely, implicitly in the same way as toDF
):
// spark: SparkSession
import spark.implicits._
implicitly[Encoder[MyCaseClass]].schema
Unfortunately, this actually suffers from the same problem as using org.apache.spark.sql.catalyst
or Encoders
as in the other answers: the Encoder trait is experimental.
How does this work? The toDF
method on Seq
comes from a DatasetHolder
, which is created via the implicit localSeqToDatasetHolder that is imported via spark.implicits._
. That function is defined like:
implicit def localSeqToDatasetHolder[T](s: Seq[T])(implicit arg0: Encoder[T]): DatasetHolder[T]
As you can see, it takes an implicit
Encoder[T]
argument, which, for a case class
, can be computed via newProductEncoder (also imported via spark.implicits._
). We can reproduce this implicit logic to get an Encoder
for our case class, via the convenience scala.Predef.implicitly (in scope by default, because it's from Predef
) that will just returns its requested implicit argument:
def implicitly[T](implicit e: T): T
in case someone wants to do this for a custom Java bean:
ExpressionEncoder.javaBean(Event.class).schema().json()
You can do it the same way SQLContext.createDataFrame does it:
import org.apache.spark.sql.catalyst.ScalaReflection
val schema = ScalaReflection.schemaFor[TestCase].dataType.asInstanceOf[StructType]