Spark iterate HDFS directory

后端 未结 8 1907
耶瑟儿~
耶瑟儿~ 2020-12-01 01:48

I have a directory of directories on HDFS, and I want to iterate over the directories. Is there any easy way to do this with Spark using the SparkContext object?

相关标签:
8条回答
  • 2020-12-01 02:31

    You can try with globStatus status as well

    val listStatus = org.apache.hadoop.fs.FileSystem.get(new URI(url), sc.hadoopConfiguration).globStatus(new org.apache.hadoop.fs.Path(url))
    
          for (urlStatus <- listStatus) {
            println("urlStatus get Path:"+urlStatus.getPath())
    }
    
    0 讨论(0)
  • 2020-12-01 02:32

    Here's PySpark version if someone is interested:

        hadoop = sc._jvm.org.apache.hadoop
    
        fs = hadoop.fs.FileSystem
        conf = hadoop.conf.Configuration() 
        path = hadoop.fs.Path('/hivewarehouse/disc_mrt.db/unified_fact/')
    
        for f in fs.get(conf).listStatus(path):
            print(f.getPath(), f.getLen())
    

    In this particular case I get list of all files that make up disc_mrt.unified_fact Hive table.

    Other methods of FileStatus object, like getLen() to get file size are described here:

    Class FileStatus

    0 讨论(0)
  • 2020-12-01 02:35

    this did the job for me

    FileSystem.get(new URI("hdfs://HAservice:9000"), sc.hadoopConfiguration).listStatus( new Path("/tmp/")).foreach( x => println(x.getPath ))
    
    0 讨论(0)
  • 2020-12-01 02:38
    import  org.apache.hadoop.fs.{FileSystem,Path}
    
    FileSystem.get( sc.hadoopConfiguration ).listStatus( new Path("hdfs:///tmp")).foreach( x => println(x.getPath ))
    

    This worked for me.

    Spark version 1.5.0-cdh5.5.2

    0 讨论(0)
  • 2020-12-01 02:38

    @Tagar didn't say how to connect remote hdfs, but this answer did:

    URI           = sc._gateway.jvm.java.net.URI
    Path          = sc._gateway.jvm.org.apache.hadoop.fs.Path
    FileSystem    = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
    Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
    
    
    fs = FileSystem.get(URI("hdfs://somehost:8020"), Configuration())
    
    status = fs.listStatus(Path('/some_dir/yet_another_one_dir/'))
    
    for fileStatus in status:
        print(fileStatus.getPath())
    
    0 讨论(0)
  • 2020-12-01 02:40

    I had some issues with other answers(like 'JavaObject' object is not iterable), but this code works for me

    fs = self.spark_contex._jvm.org.apache.hadoop.fs.FileSystem.get(spark_contex._jsc.hadoopConfiguration())
    i = fs.listFiles(spark_contex._jvm.org.apache.hadoop.fs.Path(path), False)
    while i.hasNext():
      f = i.next()
      print(f.getPath())
    
    0 讨论(0)
提交回复
热议问题