Spark out of memory

前端 未结 3 1503
面向向阳花
面向向阳花 2021-02-04 12:41

I have a folder with 150 G of txt files (around 700 files, on average each 200 MB).

I\'m using scala to process the files and calculate some aggregate statistics in the

3条回答
  •  隐瞒了意图╮
    2021-02-04 13:13

    Yes, PySpark RDD/DataFrame collect() function is used to retrieve all the elements of the dataset (from all nodes) to the driver node. We should use the collect() on smaller dataset usually after filter(), group(), count() etc. Retrieving larger dataset results in out of memory.

提交回复
热议问题