Spark - creating schema programmatically with different data types

问题

I have a dataset consisting of 7-8 fields which are of type String, Int & Float.

Am trying to create Schema by Programmatic approach by using this :

val schema = StructType(header.split(",").map(column => StructField(column, StringType, true)))

And Then mapping it to Row type like :

val dataRdd = datafile.filter(x => x!=header).map(x => x.split(",")).map(col => Row(col(0).trim, col(1).toInt, col(2).toFloat, col(3), col(4) ,col(5), col(6), col(7), col(8)))

But after creating DataFrame when i use DF.show() it gives error for the Integer field.

So how to create such schema where we have multiple data type in the dataset

回答1:

The problem you have in your code is that you are assigning all the fields as StringType.

Assuming that in the header you have only the name of the fields, then you can't guess the type.

Let's assume that the header string is like this

val header = "field1:Int,field2:Double,field3:String"

Then the code should be

def inferType(field: String) = field.split(":")(1) match {
   case "Int" => IntegerType
   case "Double" => DoubleType
   case "String" => StringType
   case _ => StringType
}

val schema = StructType(header.split(",").map(column => StructField(column, inferType(column), true)))

For the header string example you get

root
 |-- field1:Int: integer (nullable = true)
 |-- field2:Double: double (nullable = true)
 |-- field3:String: string (nullable = true)

On the other hand. If what you need it's a data frame from text, I would suggest that you create the DataFrame directly from the file itself. It's pointless to create it from an RDD.

val fileReader = spark.read.format("com.databricks.spark.csv")
  .option("mode", "DROPMALFORMED")
  .option("header", "true")
  .option("inferschema", "true")
  .option("delimiter", ",")

val df = fileReader.load(PATH_TO_FILE)

回答2:

Defining the Structure Type first :

val schema1 = StructType(Array(
  StructField("AcutionId", StringType, true),
  StructField("Bid", IntegerType, false),
  StructField("BidTime", FloatType, false),
  StructField("Bidder", StringType, true),
  StructField("BidderRate", FloatType, false),
  StructField("OpenBid", FloatType, false),
  StructField("Price", FloatType, false),
  StructField("Item", StringType, true),
  StructField("DaystoLive", IntegerType, false)
))

Then specifying each column that is going to b present inside a Row by converting it to specific types:

val dataRdd = datafile.filter(x => x!=header).map(x => x.split(","))
  .map(col => Row(
    col(0).trim,
    col(1).trim.toInt,
    col(2).trim.toFloat,
    col(3).trim,
    col(4).trim.toFloat,
    col(5).trim.toFloat,
    col(6).trim.toFloat,
    col(7).trim,
    col(8).trim.toInt)
  )

Then applying the Schema to the RDD

val auctionDF = spark.sqlContext.createDataFrame(dataRdd,schema1)

来源：https://stackoverflow.com/questions/44170145/spark-creating-schema-programmatically-with-different-data-types

标签

scala

apache-spark

apache-spark-sql

spark-dataframe