How to explode multiple columns of a dataframe in pyspark

后端 未结 3 791
醉梦人生
醉梦人生 2020-12-25 08:15

I have a dataframe which consists lists in columns similar to the following. The length of the lists in all columns is not same.

Name  Age  Subjects                 


        
相关标签:
3条回答
  • 2020-12-25 09:00

    PySpark has added an arrays_zip function in 2.4, which eliminates the need for a Python UDF to zip the arrays.

    import pyspark.sql.functions as F
    from pyspark.sql.types import *
    
    df = sql.createDataFrame(
        [(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
        ['Name','Age','Subjects', 'Grades'])
    df = df.withColumn("new", F.arrays_zip("Subjects", "Grades"))\
           .withColumn("new", F.explode("new"))\
           .select("Name", "Age", F.col("new.Subjects").alias("Subjects"), F.col("new.Grades").alias("Grades"))
    df.show()
    
    +-----+----+---------+------+
    | Name| Age| Subjects|Grades|
    +-----+----+---------+------+
    |[Bob]|[16]|    Maths|     A|
    |[Bob]|[16]|  Physics|     B|
    |[Bob]|[16]|Chemistry|     C|
    +-----+----+---------+------+
    
    0 讨论(0)
  • 2020-12-25 09:19

    Have you tried this

    df.select(explode(split(col("Subjects"))).alias("Subjects")).show()
    

    you can convert the data frame to an RDD.

    For an RDD you can use a flatMap function to separate the Subjects.

    0 讨论(0)
  • 2020-12-25 09:21

    This works,

    import pyspark.sql.functions as F
    from pyspark.sql.types import *
    
    df = sql.createDataFrame(
        [(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
        ['Name','Age','Subjects', 'Grades'])
    df.show()
    
    +-----+----+--------------------+---------+
    | Name| Age|            Subjects|   Grades|
    +-----+----+--------------------+---------+
    |[Bob]|[16]|[Maths, Physics, ...|[A, B, C]|
    +-----+----+--------------------+---------+
    

    Use udf with zip. Those columns needed to explode have to be merged before exploding.

    combine = F.udf(lambda x, y: list(zip(x, y)),
                  ArrayType(StructType([StructField("subs", StringType()),
                                        StructField("grades", StringType())])))
    
    df = df.withColumn("new", combine("Subjects", "Grades"))\
           .withColumn("new", F.explode("new"))\
           .select("Name", "Age", F.col("new.subs").alias("Subjects"), F.col("new.grades").alias("Grades"))
    df.show()
    
    
    +-----+----+---------+------+
    | Name| Age| Subjects|Grades|
    +-----+----+---------+------+
    |[Bob]|[16]|    Maths|     A|
    |[Bob]|[16]|  Physics|     B|
    |[Bob]|[16]|Chemistry|     C|
    +-----+----+---------+------+
    
    0 讨论(0)
提交回复
热议问题