How to explode columns?

后端 未结 4 883
無奈伤痛
無奈伤痛 2020-12-25 14:42

After:

val df = Seq((1, Vector(2, 3, 4)), (1, Vector(2, 3, 4))).toDF(\"Col1\", \"Col2\")

I have this DataFrame in Apache Spark:

         


        
相关标签:
4条回答
  • 2020-12-25 15:08

    You can use a map:

    df.map {
        case Row(col1: Int, col2: mutable.WrappedArray[Int]) => (col1, col2(0), col2(1), col2(2))
    }.toDF("Col1", "Col2", "Col3", "Col4").show()
    
    0 讨论(0)
  • 2020-12-25 15:16

    Just add on to sgvd's solution:

    If the size is not always the same, you can set nElements like this:

    val nElements = df.select(size('Col2).as("Col2_count"))
                      .select(max("Col2_count"))
                      .first.getInt(0)
    
    0 讨论(0)
  • 2020-12-25 15:23

    A solution that doesn't convert to and from RDD:

    df.select($"Col1", $"Col2"(0) as "Col2", $"Col2"(1) as "Col3", $"Col2"(2) as "Col3")
    

    Or arguable nicer:

    val nElements = 3
    df.select(($"Col1" +: Range(0, nElements).map(idx => $"Col2"(idx) as "Col" + (idx + 2)):_*))
    

    The size of a Spark array column is not fixed, you could for instance have:

    +----+------------+
    |Col1|        Col2|
    +----+------------+
    |   1|   [2, 3, 4]|
    |   1|[2, 3, 4, 5]|
    +----+------------+
    

    So there is no way to get the amount of columns and create those. If you know the size is always the same, you can set nElements like this:

    val nElements = df.select("Col2").first.getList(0).size
    
    0 讨论(0)
  • 2020-12-25 15:31

    Just to give the Pyspark version of sgvd's answer. If the array column is in Col2, then this select statement will move the first nElements of each array in Col2 to their own columns:

    from pyspark.sql import functions as F            
    df.select([F.col('Col2').getItem(i) for i in range(nElements)])
    
    0 讨论(0)
提交回复
热议问题