pyspark/dataframe - creating a nested structure

后端 未结 2 1721

i\'m using pyspark with dataframe and would like to create a nested structure as below

Before:

Column 1 | Column 2 | Column 3 
--------------------------         


        
2条回答
  •  囚心锁ツ
    2021-01-22 06:46

    First, reproducible example of your dataframe.

    js = [{"col1": "A", "col2":"B", "col3":1},{"col1": "A", "col2":"B", "col3":2},{"col1": "A", "col2":"C", "col3":1}]
    jsrdd = sc.parallelize(js)
    sqlContext = SQLContext(sc)
    jsdf = sqlContext.read.json(jsrdd)
    jsdf.show()
    +----+----+----+
    |col1|col2|col3|
    +----+----+----+
    |   A|   B|   1|
    |   A|   B|   2|
    |   A|   C|   1|
    +----+----+----+
    

    Now, lists are not stored as key value pairs. You can either use a dictionary or simple collect_list() after doing a groupby on column2.

    jsdf.groupby(['col1', 'col2']).agg(F.collect_list('col3')).show()
    +----+----+------------------+
    |col1|col2|collect_list(col3)|
    +----+----+------------------+
    |   A|   C|               [1]|
    |   A|   B|            [1, 2]|
    +----+----+------------------+
    

提交回复
热议问题