python spark alternative to explode for very large data

后端 未结 2 2119
情歌与酒
情歌与酒 2021-01-06 04:17

I have a dataframe like this:

df = spark.createDataFrame([(0, [\"B\",\"C\",\"D\",\"E\"]),(1,[\"E\",\"A\",\"C\"]),(2, [\"F\",\"A\",\"E\",\"B\"]),(3,[\"E\",\"G         


        
相关标签:
2条回答
  • 2021-01-06 04:59

    What you need to do is reduce the size of your partitions going into the explode. There are 2 options to do this. First, if your input data is splittable you can decrease the size of spark.sql.files.maxPartitionBytes so Spark reads smaller splits. The other option would be to repartition before the explode.

    The default value of maxPartitionBytes is 128MB, so Spark will attempt to read your data in 128MB chunks. If the data is not splittable then it'll read the full file into a single partition in which case you'll need to do a repartition instead.

    In your case since you're doing an explode, say it's 100x increase with 128MB per partition going in, you're ending up with 12GB+ per partition coming out!

    The other thing you may need to consider is your shuffle partitions since you're doing an aggregation. So again, you may need to increase the partitioning for the aggregation after the explode by setting spark.sql.shuffle.partitions to a higher value than the default 200. You can use the Spark UI to look at your shuffle stage and see how much data each task is reading in and adjust accordingly.

    I discuss this and other tuning suggestions in the talk I just gave at Spark Summit Europe.

    0 讨论(0)
  • 2021-01-06 04:59

    Observation:

    explode won't change overall amount of data in your pipeline. The total amount of required space is the same in both wide (array) and long (exploded) format. Moreover the latter one distributes better in Spark, which better suited for long and narrow than short and wide data. So

    df.select(explode("items").alias("item")).groupBy("item").count()
    

    is the way to go.

    However if you really want to avoid that (for whatever reason) you can use RDD and aggregate.

    from collections import Counter
    
    df.rdd.aggregate(
      Counter(), 
      lambda acc, row: acc + Counter(row.items),
      lambda acc1, acc2: acc1 + acc2
    )
    # Counter({'B': 3, 'C': 3, 'D': 2, 'E': 5, 'A': 4, 'F': 1, 'G': 1}) 
    

    Not that, unlike the DataFrame explode, it stores all data in memory and is eager.

    0 讨论(0)
提交回复
热议问题