I have a directory of directories on HDFS, and I want to iterate over the directories. Is there any easy way to do this with Spark using the SparkContext object?
You can use org.apache.hadoop.fs.FileSystem. Specifically, FileSystem.listFiles([path], true)
And with Spark...
FileSystem.get(sc.hadoopConfiguration).listFiles(..., true)
Edit
It's worth noting that good practice is to get the FileSystem
that is associated with the Path
's scheme.
path.getFileSystem(sc.hadoopConfiguration).listFiles(path, true)
Scala FileSystem (Apache Hadoop Main 3.2.1 API)
import org.apache.hadoop.fs.{FileSystem, Path}
val fileSystem : FileSystem = {
val conf = new Configuration()
conf.set( "fs.defaultFS", "hdfs://to_file_path" )
FileSystem.get( conf )
}
val files = fileSystem.listFiles( new Path( path ), false )
val filenames = ListBuffer[ String ]( )
while ( files.hasNext ) files.next().getPath().toString()