Splitting a row in a PySpark Dataframe into multiple rows

前端 未结 1 1867
难免孤独
难免孤独 2021-02-20 08:20

I currently have a dataframe where one column is of type "a b c d e ...". Call this column "col4"

I would like to split a single row into multiple by

1条回答
  •  暗喜
    暗喜 (楼主)
    2021-02-20 08:47

    Here's a reproducible example:

    # Create dummy data
    df = sc.parallelize([(1, 2, 3, 'a b c'),
                         (4, 5, 6, 'd e f'),
                         (7, 8, 9, 'g h i')]).toDF(['col1', 'col2', 'col3','col4'])
    
    
    # Explode column
    from pyspark.sql.functions import split, explode
    df.withColumn('col4',explode(split('col4',' '))).show()
    +----+----+----+----+
    |col1|col2|col3|col4|
    +----+----+----+----+
    |   1|   2|   3|   a|
    |   1|   2|   3|   b|
    |   1|   2|   3|   c|
    |   4|   5|   6|   d|
    |   4|   5|   6|   e|
    |   4|   5|   6|   f|
    |   7|   8|   9|   g|
    |   7|   8|   9|   h|
    |   7|   8|   9|   i|
    +----+----+----+----+
    

    0 讨论(0)
提交回复
热议问题