How to split multi-value column into separate rows using typed Dataset?

后端 未结 3 1928
無奈伤痛
無奈伤痛 2021-01-05 07:17

I am facing an issue of how to split a multi-value column, i.e. List[String], into separate rows.

The initial dataset has following types: Dataset

相关标签:
3条回答
  • 2021-01-05 07:32

    explode is often suggested, but it's from the untyped DataFrame API and given you use Dataset, I think flatMap operator might be a better fit (see org.apache.spark.sql.Dataset).

    flatMap[U](func: (T) ⇒ TraversableOnce[U])(implicit arg0: Encoder[U]): Dataset[U]
    

    (Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.

    You could use it as follows:

    val ds = Seq(
      (0, "Lorem ipsum dolor", 1.0, Array("prp1", "prp2", "prp3")))
      .toDF("id", "text", "value", "properties")
      .as[(Integer, String, Double, scala.List[String])]
    
    scala> ds.flatMap { t => 
      t._4.map { prp => 
        (t._1, t._2, t._3, prp) }}.show
    +---+-----------------+---+----+
    | _1|               _2| _3|  _4|
    +---+-----------------+---+----+
    |  0|Lorem ipsum dolor|1.0|prp1|
    |  0|Lorem ipsum dolor|1.0|prp2|
    |  0|Lorem ipsum dolor|1.0|prp3|
    +---+-----------------+---+----+
    
    // or just using for-comprehension
    for {
      t <- ds
      prp <- t._4
    } yield (t._1, t._2, t._3, prp)
    
    0 讨论(0)
  • 2021-01-05 07:48

    Here's one way to do it:

    val myRDD = sc.parallelize(Array(
      (0, "text0", 1.0, List("prp1", "prp2", "prp3")),
      (1, "text1", 2.0, List("prp4", "prp5", "prp6")),
      (2, "text2", 3.0, List("prp7", "prp8", "prp9"))
    )).map{
      case (i, t, v, ps) => ((i, t, v), ps)
    }.flatMapValues(x => x).map{
      case ((i, t, v), p) => (i, t, v, p)
    }
    
    0 讨论(0)
  • 2021-01-05 07:50

    You can use explode:

    df.withColumn("property", explode($"property"))
    

    Example:

    val df = Seq((1, List("a", "b"))).toDF("A", "B")   
    // df: org.apache.spark.sql.DataFrame = [A: int, B: array<string>]
    
    df.withColumn("B", explode($"B")).show
    +---+---+
    |  A|  B|
    +---+---+
    |  1|  a|
    |  1|  b|
    +---+---+
    
    0 讨论(0)
提交回复
热议问题