Spark java.lang.OutOfMemoryError: Java heap space

后端 未结 12 1996
半阙折子戏
半阙折子戏 2020-11-22 13:55

My cluster: 1 master, 11 slaves, each node has 6 GB memory.

My settings:

spark.executor.memory=4g, Dspark.akka.frameSize=512

相关标签:
12条回答
  • 2020-11-22 14:16

    You should configure offHeap memory settings as shown below:

    val spark = SparkSession
         .builder()
         .master("local[*]")
         .config("spark.executor.memory", "70g")
         .config("spark.driver.memory", "50g")
         .config("spark.memory.offHeap.enabled",true)
         .config("spark.memory.offHeap.size","16g")   
         .appName("sampleCodeForReference")
         .getOrCreate()
    

    Give the driver memory and executor memory as per your machines RAM availability. You can increase the offHeap size if you are still facing the OutofMemory issue.

    0 讨论(0)
  • 2020-11-22 14:16

    From my understanding of the code provided above, it loads the file and does map operation and saves it back. There is no operation that requires shuffle. Also, there is no operation that requires data to be brought to the driver hence tuning anything related to shuffle or driver may have no impact. The driver does have issues when there are too many tasks but this was only till spark 2.0.2 version. There can be two things which are going wrong.

    • There are only one or a few executors. Increase the number of executors so that they can be allocated to different slaves. If you are using yarn need to change num-executors config or if you are using spark standalone then need to tune num cores per executor and spark max cores conf. In standalone num executors = max cores / cores per executor .
    • The number of partitions are very few or maybe only one. So if this is low even if we have multi-cores,multi executors it will not be of much help as parallelization is dependent on the number of partitions. So increase the partitions by doing imageBundleRDD.repartition(11)
    0 讨论(0)
  • 2020-11-22 14:21

    You should increase the driver memory. In your $SPARK_HOME/conf folder you should find the file spark-defaults.conf, edit and set the spark.driver.memory 4000m depending on the memory on your master, I think. This is what fixed the issue for me and everything runs smoothly

    0 讨论(0)
  • 2020-11-22 14:21

    The location to set the memory heap size (at least in spark-1.0.0) is in conf/spark-env. The relevant variables are SPARK_EXECUTOR_MEMORY & SPARK_DRIVER_MEMORY. More docs are in the deployment guide

    Also, don't forget to copy the configuration file to all the slave nodes.

    0 讨论(0)
  • 2020-11-22 14:27

    I have a few suggestions:

    • If your nodes are configured to have 6g maximum for Spark (and are leaving a little for other processes), then use 6g rather than 4g, spark.executor.memory=6g. Make sure you're using as much memory as possible by checking the UI (it will say how much mem you're using)
    • Try using more partitions, you should have 2 - 4 per CPU. IME increasing the number of partitions is often the easiest way to make a program more stable (and often faster). For huge amounts of data you may need way more than 4 per CPU, I've had to use 8000 partitions in some cases!
    • Decrease the fraction of memory reserved for caching, using spark.storage.memoryFraction. If you don't use cache() or persist in your code, this might as well be 0. It's default is 0.6, which means you only get 0.4 * 4g memory for your heap. IME reducing the mem frac often makes OOMs go away. UPDATE: From spark 1.6 apparently we will no longer need to play with these values, spark will determine them automatically.
    • Similar to above but shuffle memory fraction. If your job doesn't need much shuffle memory then set it to a lower value (this might cause your shuffles to spill to disk which can have catastrophic impact on speed). Sometimes when it's a shuffle operation that's OOMing you need to do the opposite i.e. set it to something large, like 0.8, or make sure you allow your shuffles to spill to disk (it's the default since 1.0.0).
    • Watch out for memory leaks, these are often caused by accidentally closing over objects you don't need in your lambdas. The way to diagnose is to look out for the "task serialized as XXX bytes" in the logs, if XXX is larger than a few k or more than an MB, you may have a memory leak. See https://stackoverflow.com/a/25270600/1586965
    • Related to above; use broadcast variables if you really do need large objects.
    • If you are caching large RDDs and can sacrifice some access time consider serialising the RDD http://spark.apache.org/docs/latest/tuning.html#serialized-rdd-storage. Or even caching them on disk (which sometimes isn't that bad if using SSDs).
    • (Advanced) Related to above, avoid String and heavily nested structures (like Map and nested case classes). If possible try to only use primitive types and index all non-primitives especially if you expect a lot of duplicates. Choose WrappedArray over nested structures whenever possible. Or even roll out your own serialisation - YOU will have the most information regarding how to efficiently back your data into bytes, USE IT!
    • (bit hacky) Again when caching, consider using a Dataset to cache your structure as it will use more efficient serialisation. This should be regarded as a hack when compared to the previous bullet point. Building your domain knowledge into your algo/serialisation can minimise memory/cache-space by 100x or 1000x, whereas all a Dataset will likely give is 2x - 5x in memory and 10x compressed (parquet) on disk.

    http://spark.apache.org/docs/1.2.1/configuration.html

    EDIT: (So I can google myself easier) The following is also indicative of this problem:

    java.lang.OutOfMemoryError : GC overhead limit exceeded
    
    0 讨论(0)
  • 2020-11-22 14:27

    I suffered from this issue a lot when using dynamic resource allocation. I had thought it would utilize my cluster resources to best fit the application.

    But the truth is the dynamic resource allocation doesn't set the driver memory and keeps it to its default value, which is 1G.

    I resolved this issue by setting spark.driver.memory to a number that suits my driver's memory (for 32GB ram I set it to 18G).

    You can set it using spark submit command as follows:

    spark-submit --conf spark.driver.memory=18g
    

    Very important note, this property will not be taken into consideration if you set it from code, according to Spark Documentation - Dynamically Loading Spark Properties:

    Spark properties mainly can be divided into two kinds: one is related to deploy, like “spark.driver.memory”, “spark.executor.instances”, this kind of properties may not be affected when setting programmatically through SparkConf in runtime, or the behavior is depending on which cluster manager and deploy mode you choose, so it would be suggested to set through configuration file or spark-submit command line options; another is mainly related to Spark runtime control, like “spark.task.maxFailures”, this kind of properties can be set in either way.

    0 讨论(0)
提交回复
热议问题