发表新帖

发表新帖

Spark out of memory

前端未结

关注

 3  1513

面向向阳花 2021-02-04 12:41

I have a folder with 150 G of txt files (around 700 files, on average each 200 MB).

I\'m using scala to process the files and calculate some aggregate statistics in the

3条回答

隐瞒了意图╮ (楼主)

2021-02-04 13:13

Yes, PySpark RDD/DataFrame collect() function is used to retrieve all the elements of the dataset (from all nodes) to the driver node. We should use the collect() on smaller dataset usually after filter(), group(), count() etc. Retrieving larger dataset results in out of memory.

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题