Out of Memory Error when Reading large file in Spark 2.1.0

左心房为你撑大大i 提交于 2019-12-05 12:00:46

I was getting this error when running spark-shell and hence I increased the driver memory to a high number. Then I was able to load the XML.

spark-shell --driver-memory 6G

Source: https://github.com/lintool/warcbase/issues/246#issuecomment-249272263

Because you are storing your RDD twice and Your logic must be change like this or filter with SparkSql

 val df: DataFrame = SparkFactory.spark.read
      .option("mode", "DROPMALFORMED")
      .schema(customSchema) // defined previously
      .option("rowTag", "row")

    println(s"\n\nNUM PARTITIONS: ${df.rdd.getNumPartitions}\n\n")
    // prints 1604

    // regexes to clean the text
    val tagPat = "<[^>]+>".r
    val angularBracketsPat = "><|>|<"
    val whitespacePat = """\s+""".r

    // filter and select only the cols i'm interested in
      .where( df.col("_TypeId") === "1" )
        case Post(id,title,body,tags) =>

          val body1 = tagPat.replaceAllIn(body,"")
          val body2 = whitespacePat.replaceAllIn(body1," ")

          Post(id,title.toLowerCase,body2.toLowerCase, tags.split(angularBracketsPat).mkString(","))

      .orderBy(rand(SEED)) // random sort
      .write // write it back to disk
      .option("quoteAll", true)

You can change the heap size by adding the following in your environment variable:

  1. Environment Variable name : _JAVA_OPTIONS
  2. Environment Variable Value : -Xmx512M -Xms512M