Spark out of memory

前端未结

关注

 3  1512

面向向阳花

I have a folder with 150 G of txt files (around 700 files, on average each 200 MB).

I\'m using scala to process the files and calculate some aggregate statistics in the

相关标签:

3条回答

慢半拍i

2021-02-04 12:54
Me and my team had processed a csv data sized over 1 TB over 5 machine @32GB of RAM each successfully. It depends heavily what kind of processing you're doing and how.
1. If you repartition an RDD, it requires additional computation that has overhead above your heap size, try loading the file with more paralelism by decreasing split-size in TextInputFormat.SPLIT_MINSIZE and TextInputFormat.SPLIT_MAXSIZE (if you're using TextInputFormat) to elevate the level of paralelism.
2. Try using mapPartition instead of map so you can handle the computation inside a partition. If the computation uses a temporary variable or instance and you're still facing out of memory, try lowering the number of data per partition (increasing the partition number)
3. Increase the driver memory and executor memory limit using "spark.executor.memory" and "spark.driver.memory" in spark configuration before creating Spark Context
Note that Spark is a general-purpose cluster computing system so it's unefficient (IMHO) using Spark in a single machine
0 讨论(0)
发布评论:

提交评论
- 加载中...
别跟我提以往

2021-02-04 12:57
To add another perspective based on code (as opposed to configuration): Sometimes it's best to figure out at what stage your Spark application is exceeding memory, and to see if you can make changes to fix the problem. When I was learning Spark, I had a Python Spark application that crashed with OOM errors. The reason was because I was collecting all the results back in the master rather than letting the tasks save the output.

E.g.
```
for item in processed_data.collect():
   print(item)
```
- failed with OOM errors. On the other hand,
processed_data.saveAsTextFile(output_dir)
- worked fine.
0 讨论(0)
发布评论:

提交评论
- 加载中...
隐瞒了意图╮

2021-02-04 13:13

Yes, PySpark RDD/DataFrame collect() function is used to retrieve all the elements of the dataset (from all nodes) to the driver node. We should use the collect() on smaller dataset usually after filter(), group(), count() etc. Retrieving larger dataset results in out of memory.

0 讨论(0)
发布评论:

提交评论
- 加载中...