Spark out of memory

前端 未结 3 1511
面向向阳花
面向向阳花 2021-02-04 12:41

I have a folder with 150 G of txt files (around 700 files, on average each 200 MB).

I\'m using scala to process the files and calculate some aggregate statistics in the

3条回答
  •  慢半拍i
    慢半拍i (楼主)
    2021-02-04 12:54

    Me and my team had processed a csv data sized over 1 TB over 5 machine @32GB of RAM each successfully. It depends heavily what kind of processing you're doing and how.

    1. If you repartition an RDD, it requires additional computation that has overhead above your heap size, try loading the file with more paralelism by decreasing split-size in TextInputFormat.SPLIT_MINSIZE and TextInputFormat.SPLIT_MAXSIZE (if you're using TextInputFormat) to elevate the level of paralelism.

    2. Try using mapPartition instead of map so you can handle the computation inside a partition. If the computation uses a temporary variable or instance and you're still facing out of memory, try lowering the number of data per partition (increasing the partition number)

    3. Increase the driver memory and executor memory limit using "spark.executor.memory" and "spark.driver.memory" in spark configuration before creating Spark Context

    Note that Spark is a general-purpose cluster computing system so it's unefficient (IMHO) using Spark in a single machine

提交回复
热议问题