collect() or toPandas() on a large DataFrame in pyspark/EMR

后端未结

关注

 3  577

I have an EMR cluster of one machine \"c3.8xlarge\", after reading several resources, I understood that I have to allow decent amount of memory off-heap because I am using p

相关标签:

3条回答

眼角桃花

2020-11-30 13:19
TL;DR I believe you're seriously underestimating memory requirements.

Even assuming that data is fully cached, storage info will show only a fraction of peak memory required for bringing data back to the driver.
- First of all Spark SQL uses compressed columnar storage for caching. Depending on the data distribution and compression algorithm in-memory size can be much smaller than the uncompressed Pandas output, not to mention plain List[Row]. The latter also stores column names, further increasing memory usage.
- Data collection is indirect, with data being stored both on the JVM side and Python side. While JVM memory can be released once data goes through socket, peak memory usage should account for both.
- Plain toPandas implementation collects Rows first, then creates Pandas DataFrame locally. This further increases (possibly doubles) memory usage. Luckily this part is already addressed on master (Spark 2.3), with more direct approach using Arrow serialization (SPARK-13534 - Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas).
  
  For possible solution independent of Apache Arrow you can check Faster and Lower memory implementation toPandas on the Apache Spark Developer List.
Since data is actually pretty large I would consider writing it to Parquet and reading it back directly in Python using PyArrow (Reading and Writing the Apache Parquet Format) completely skipping all the intermediate stages.
0 讨论(0)
发布评论:

提交评论
- 加载中...

时光取名叫无心

2020-11-30 13:29

By using arrow setting u will see a speedup

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

0 讨论(0)

佛祖请我去吃肉

2020-11-30 13:34

As mentioned above, when calling toPandas(), all records of the DataFrame are collected to the driver program and hence should be done on a small subset of the data. (https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html)

0 讨论(0)
发布评论:

提交评论
- 加载中...