发表新帖

发表新帖

Out of memory error when collecting data out of Spark cluster

前端未结

关注

 2  1751

臣服心动 2021-02-05 11:25

I know there are plenty of questions on SO about out of memory errors on Spark but I haven\'t found a solution to mine.

I have a simple workflow:

read in O

2条回答

慢半拍i (楼主)

2021-02-05 11:35
As mentioned above, "cache" is not action, check RDD Persistence:
```
You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. 
```
But "collect" is an action, and all computations (including "cache") will be started when "collect" is called.

You run application in standalone mode, it means, initial data loading and all computations will be performed in the same memory.

Data downloading and other computations are used most memory, not "collect".

You can check it by replacing "collect" with "count".
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题