Spark iterate HDFS directory

后端未结

关注

 8  1908

I have a directory of directories on HDFS, and I want to iterate over the directories. Is there any easy way to do this with Spark using the SparkContext object?

相关标签:

8条回答

小鲜肉

2020-12-01 02:44
You can use org.apache.hadoop.fs.FileSystem. Specifically, FileSystem.listFiles([path], true)

And with Spark...
```
FileSystem.get(sc.hadoopConfiguration).listFiles(..., true)
```
Edit

It's worth noting that good practice is to get the FileSystem that is associated with the Path's scheme.
```
path.getFileSystem(sc.hadoopConfiguration).listFiles(path, true)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

梦毁少年i

2020-12-01 02:44

Scala FileSystem (Apache Hadoop Main 3.2.1 API)

import org.apache.hadoop.fs.{FileSystem, Path}

val fileSystem : FileSystem = {
    val conf = new Configuration()
    conf.set( "fs.defaultFS", "hdfs://to_file_path" )
    FileSystem.get( conf )
}

val files = fileSystem.listFiles( new Path( path ), false )
val filenames = ListBuffer[ String ]( )
while ( files.hasNext ) files.next().getPath().toString()

0 讨论(0)

上一页 1 2