Spark Job error GC overhead limit exceeded [duplicate]

问题

I am running a spark job and I am setting the following configurations in the spark-defaults.sh. I have the following changes in the name node. I have 1 data node. And I am working on data of 2GB.

spark.master                     spark://master:7077
spark.executor.memory            5g
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://namenode:8021/directory
spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.driver.memory              5g
spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"

But I am getting an error saying GC limit exceeded.

Here is the code I am working on.

import os
import sys
import unicodedata
from operator import add 

try:
    from pyspark import SparkConf
    from pyspark import SparkContext
except ImportError as e:
    print ("Error importing Spark Modules", e)
    sys.exit(1)


# delimeter function
def findDelimiter(text):
    sD = text[1] 
    eD = text[2] 
    return (eD, sD) 

def tokenize(text):
    sD = findDelimiter(text)[1]
    eD = findDelimiter(text)[0]
    arrText = text.split(sD)
    text = ""
    seg = arrText[0].split(eD)
    arrText=""
    senderID = seg[6].strip()
    yield (senderID, 1)


conf = SparkConf()
sc = SparkContext(conf=conf)

textfile = sc.textFile("hdfs://my_IP:9000/data/*/*.txt")

rdd = textfile.flatMap(tokenize)
rdd = rdd.reduceByKey(lambda a,b: a+b)
rdd.coalesce(1).saveAsTextFile("hdfs://my_IP:9000/data/total_result503")

I even tried groupByKey instead of also. But I am getting the same error. But when I tried removing the reduceByKey or groupByKey I am getting outputs. Can some one help me with this error.

Should I also increase the size of GC in hadoop. And as I said earlier I have set driver.memory to 5gb, I did it in the name node. Should I do that in data node as well?

回答1:

Try to add below setting for your spark-defaults.sh:

spark.driver.extraJavaOptions -XX:+UseG1GC

spark.executor.extraJavaOptions -XX:+UseG1GC

Tuning jvm garbage collection might be tricky, but "G1GC" seems works pretty good. Worth trying!!

回答2:

The code you have should have worked with your configuration . As suggested earlier try using G1GC . Also try reducing storage memory fraction . By default its 60% . Try reducing it to 40% or less. You can set it by adding spark.storage.memoryFraction 0.4

回答3:

I was able to solve the problem. I was running my hadoop in the root user of the master node. But I configured the hadoop in a different user in the datanodes. Now I configured them in the root user of the data node and increased the executor and driver memory it worked fine.

来源：https://stackoverflow.com/questions/37958522/spark-job-error-gc-overhead-limit-exceeded

标签

Hadoop

apache-spark

garbage-collection

out-of-memory