Spark iterate HDFS directory

后端 未结 8 1908
耶瑟儿~
耶瑟儿~ 2020-12-01 01:48

I have a directory of directories on HDFS, and I want to iterate over the directories. Is there any easy way to do this with Spark using the SparkContext object?

相关标签:
8条回答
  • 2020-12-01 02:44

    You can use org.apache.hadoop.fs.FileSystem. Specifically, FileSystem.listFiles([path], true)

    And with Spark...

    FileSystem.get(sc.hadoopConfiguration).listFiles(..., true)
    

    Edit

    It's worth noting that good practice is to get the FileSystem that is associated with the Path's scheme.

    path.getFileSystem(sc.hadoopConfiguration).listFiles(path, true)
    
    0 讨论(0)
  • 2020-12-01 02:44

    Scala FileSystem (Apache Hadoop Main 3.2.1 API)

    import org.apache.hadoop.fs.{FileSystem, Path}
    
    val fileSystem : FileSystem = {
        val conf = new Configuration()
        conf.set( "fs.defaultFS", "hdfs://to_file_path" )
        FileSystem.get( conf )
    }
    
    val files = fileSystem.listFiles( new Path( path ), false )
    val filenames = ListBuffer[ String ]( )
    while ( files.hasNext ) files.next().getPath().toString()
    
    0 讨论(0)
提交回复
热议问题