Out of Memory Error when Reading large file in Spark 2.1.0

左心房为你撑大大i 提交于 2019-12-05 12:00:46

I was getting this error when running spark-shell and hence I increased the driver memory to a high number. Then I was able to load the XML.

spark-shell --driver-memory 6G

Source: https://github.com/lintool/warcbase/issues/246#issuecomment-249272263

Because you are storing your RDD twice and Your logic must be change like this or filter with SparkSql

 val df: DataFrame = SparkFactory.spark.read
      .option("mode", "DROPMALFORMED")
      .format("com.databricks.spark.xml")
      .schema(customSchema) // defined previously
      .option("rowTag", "row")
      .load(s"$pathToInputXML")
      .coalesce(numPartitions)

    println(s"\n\nNUM PARTITIONS: ${df.rdd.getNumPartitions}\n\n")
    // prints 1604


    // regexes to clean the text
    val tagPat = "<[^>]+>".r
    val angularBracketsPat = "><|>|<"
    val whitespacePat = """\s+""".r

    // filter and select only the cols i'm interested in
     df
      .where( df.col("_TypeId") === "1" )
      .select(
        df("_Id").as("id"),
        df("_Title").as("title"),
        df("_Body").as("body"),
      ).as[Post]
      .map{
        case Post(id,title,body,tags) =>

          val body1 = tagPat.replaceAllIn(body,"")
          val body2 = whitespacePat.replaceAllIn(body1," ")

          Post(id,title.toLowerCase,body2.toLowerCase, tags.split(angularBracketsPat).mkString(","))

      }
      .orderBy(rand(SEED)) // random sort
      .write // write it back to disk
      .option("quoteAll", true)
      .mode(SaveMode.Overwrite)
      .csv(output)

You can change the heap size by adding the following in your environment variable:

  1. Environment Variable name : _JAVA_OPTIONS
  2. Environment Variable Value : -Xmx512M -Xms512M
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!