python spark alternative to explode for very large data

后端 未结 2 2120
情歌与酒
情歌与酒 2021-01-06 04:17

I have a dataframe like this:

df = spark.createDataFrame([(0, [\"B\",\"C\",\"D\",\"E\"]),(1,[\"E\",\"A\",\"C\"]),(2, [\"F\",\"A\",\"E\",\"B\"]),(3,[\"E\",\"G         


        
2条回答
  •  迷失自我
    2021-01-06 04:59

    Observation:

    explode won't change overall amount of data in your pipeline. The total amount of required space is the same in both wide (array) and long (exploded) format. Moreover the latter one distributes better in Spark, which better suited for long and narrow than short and wide data. So

    df.select(explode("items").alias("item")).groupBy("item").count()
    

    is the way to go.

    However if you really want to avoid that (for whatever reason) you can use RDD and aggregate.

    from collections import Counter
    
    df.rdd.aggregate(
      Counter(), 
      lambda acc, row: acc + Counter(row.items),
      lambda acc1, acc2: acc1 + acc2
    )
    # Counter({'B': 3, 'C': 3, 'D': 2, 'E': 5, 'A': 4, 'F': 1, 'G': 1}) 
    

    Not that, unlike the DataFrame explode, it stores all data in memory and is eager.

提交回复
热议问题