Too many open files in spark aborting spark job

Deadly 提交于 2019-12-25 00:18:52

问题


In my application i am reading 40 GB text files that is totally spread across 188 files . I split this files and create xml files per line in spark using pair rdd . For 40 GB of input it will create many millions small xml files and this is my requirement. All working fine but when spark saves files in S3 it throws error and job fails .

Here is the exception i get

Caused by: java.nio.file.FileSystemException: /mnt/s3/emrfs-2408623010549537848/0000000000: Too many open files at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) at java.nio.file.Files.newByteChannel(Files.java:361) at java.nio.file.Files.createFile(Files.java:632) at com.amazon.ws.emr.hadoop.fs.files.TemporaryFiles.create(TemporaryFiles.java:70) at com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.openNewPart(MultipartUploadOutputStream.java:493) ... 21 more

ApplicationMaster host: 10.97.57.198 ApplicationMaster RPC port: 0 queue: default start time: 1542344243252 final status: FAILED
tracking URL: http://ip-10-97-57-234.tr-fr-nonprod.aws-int.thomsonreuters.com:20888/proxy/application_1542343091900_0001/ user: hadoop Exception in thread "main" org.apache.spark.SparkException: Application application_1542343091900_0001 finished with failed status

And this as well

com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: D33581CA9A799F64; S3 Extended Request ID: /SlEplo+lCKQRVVH+zHiop0oh8q8WqwnNykK3Ga6/VM2HENl/eKizbd1rg4vZD1BZIpp8lk6zwA=), S3 Extended Request ID: /SlEplo+lCKQRVVH+zHiop0oh8q8WqwnNykK3Ga6/VM2HENl/eKizbd1rg4vZD1BZIpp8lk6zwA=

Here is my code to do that .

object TestAudit {

  def main(args: Array[String]) {


    val inputPath = args(0)
    val output = args(1)
    val noOfHashPartitioner = args(2).toInt

    //val conf = new SparkConf().setAppName("AuditXML").setMaster("local");
    val conf = new SparkConf().setAppName("AuditXML")

    val sc = new SparkContext(conf);
    val input = sc.textFile(inputPath)


    val pairedRDD = input.map(row => {
      val split = row.split("\\|")
      val fileName = split(0)
      val fileContent = split(1)
      (fileName, fileContent)
    })

    import org.apache.hadoop.io.NullWritable
    import org.apache.spark.HashPartitioner
    import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat

    class RddMultiTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
      override def generateActualKey(key: Any, value: Any): Any = NullWritable.get()
      override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = key.asInstanceOf[String]
    }

    pairedRDD.partitionBy(new HashPartitioner(10000)).saveAsHadoopFile("s3://a205381-tr-fr-development-us-east-1-trf-auditabilty//AUDITOUTPUT", classOf[String], classOf[String], classOf[RddMultiTextOutputFormat], classOf[GzipCodec])

  }

}

Even i tried reducing no of HashPartitioner then also it does not work


回答1:


Every process on Unix systems has a limitation of open files or number of file descriptors. As your data is large and partitions to subfile (in internal of Spark), your process meet the limitation and error. You can increase the number of file descriptors for each user as following:

edit the file: /etc/security/limits.conf and add (or modify)

*         hard    nofile      500000
*         soft    nofile      500000
root      hard    nofile      500000
root      soft    nofile      500000

This will set the nofile (number of file descriptors) feature to 500000 for each user along with the root user.

After restarting the changes will be applied.

Also, someone can set the number of file descriptors for a special process, by setting the LimitNOFILE. For example, if you use yarn to run Spark jobs and the Yarn daemon will be started using systemd, you can add LimitNOFILE=128000 to Yarn systemd script(resource manager and nodemanager) to set Yarn process number of file descriptors to 128000.

related articles:

  • 3 Methods to Change the Number of Open File Limit in Linux
  • Limits on the number of file descriptors


来源:https://stackoverflow.com/questions/53331833/too-many-open-files-in-spark-aborting-spark-job

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!