Spark - How to combine/merge elements in Dataframe which are in Seq[Row] to generate a Row

前端 未结 1 1610
轮回少年
轮回少年 2021-01-16 11:40

I want to start by saying I am forced to use Spark 1.6

I am generating a DataFrame from a JSON file like this:

{\"id\" : \"1201\", \"nam         


        
1条回答
  •  执笔经年
    2021-01-16 12:21

    The output from the map is of type (String, Row) therefore it cannot be encoded using RowEncoder alone. You have to provide matching tuple encoder:

    import org.apache.spark.sql.types._
    import org.apache.spark.sql.{Encoder, Encoders}
    import org.apache.spark.sql.catalyst.encoders.RowEncoder
    
    val encoder = Encoders.tuple(
      Encoders.STRING,
      RowEncoder(
        // The same as df.schema in your case
        StructType(Seq(
          StructField("age", StringType), 
          StructField("id", StringType),
          StructField("name", StringType)))))
    
    filterd.map{row => (
      row.getAs[String]("age"),
      PrintOne(row.getAs[Seq[Row]](0), row.getAs[String]("age")))
    }(encoder)
    

    Overall this approach looks like an anti-pattern. If you want to use more functional style you should avoid Dataset[Row]:

    case class Person(age: String, id: String, name: String)
    
    filterd.as[(Seq[Person], String)].map { 
      case (people, age)  => (age, (age, people(0).id, people(1).name))
    }
    

    or udf.

    Also please note that o.a.s.sql.catalyst package, including GenericRowWithSchema, is intended mostly for internal usage. Unless necessary otherwise, prefer o.a.s.sql.Row.

    0 讨论(0)
提交回复
热议问题