Spark Scala list folders in directory

前端 未结 9 2336
北恋
北恋 2020-12-05 09:41

I want to list all folders within a hdfs directory using Scala/Spark. In Hadoop I can do this by using the command: hadoop fs -ls hdfs://sandbox.hortonworks.com/demo/<

相关标签:
9条回答
  • 2020-12-05 10:12

    Because you're using Scala, you may also be interested in the following:

    import scala.sys.process._
    val lsResult = Seq("hadoop","fs","-ls","hdfs://sandbox.hortonworks.com/demo/").!!
    

    This will, unfortunately, return the entire output of the command as a string, and so parsing down to just the filenames requires some effort. (Use fs.listStatus instead.) But if you find yourself needing to run other commands where you could do it in the command line easily and are unsure how to do it in Scala, just use the command line through scala.sys.process._. (Use a single ! if you want to just get the return code.)

    0 讨论(0)
  • 2020-12-05 10:13

    Azure Blog Storage is mapped to a HDFS location, so all the Hadoop Operations

    On Azure Portal, go to Storage Account, you will find following details:

    • Storage account

    • Key -

    • Container -

    • Path pattern – /users/accountsdata/

    • Date format – yyyy-mm-dd

    • Event serialization format – json

    • Format – line separated

    Path Pattern here is the HDFS path, you can login/putty to the Hadoop Edge Node and do:

    hadoop fs -ls /users/accountsdata 
    

    Above command will list all the files. In Scala you can use

    import scala.sys.process._ 
    
    val lsResult = Seq("hadoop","fs","-ls","/users/accountsdata/").!!
    
    0 讨论(0)
  • 2020-12-05 10:15

    I was looking for the same, however instead of HDFS, for S3.

    I solved creating the FileSystem with my S3 path as below:

      def getSubFolders(path: String)(implicit sparkContext: SparkContext): Seq[String] = {
        val hadoopConf = sparkContext.hadoopConfiguration
        val uri = new URI(path)
    
        FileSystem.get(uri, hadoopConf).listStatus(new Path(path)).map {
          _.getPath.toString
        }
      }
    

    I know this question was related for HDFS, but maybe others like me will come here looking for S3 solution. Since without specifying the URI in FileSystem, it will look for HDFS ones.

    java.lang.IllegalArgumentException: Wrong FS: s3://<bucket>/dummy_path
    expected: hdfs://<ip-machine>.eu-west-1.compute.internal:8020
    
    0 讨论(0)
  • 2020-12-05 10:20

    In Spark 2.0+,

    import org.apache.hadoop.fs.{FileSystem, Path}
    val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
    fs.listStatus(new Path(s"${hdfs-path}")).filter(_.isDir).map(_.getPath).foreach(println)
    

    Hope this is helpful.

    0 讨论(0)
  • 2020-12-05 10:20
       val listStatus = org.apache.hadoop.fs.FileSystem.get(new URI(url), sc.hadoopConfiguration)
    .globStatus(new org.apache.hadoop.fs.Path(url))
    
      for (urlStatus <- listStatus) {
        println("urlStatus get Path:" + urlStatus.getPath())
    

    }

    0 讨论(0)
  • 2020-12-05 10:21

    in Ajay Ahujas answer isDir is deprecated..

    use isDirectory... pls see complete example and output below.

    package examples
    
        import org.apache.log4j.Level
        import org.apache.spark.sql.SparkSession
    
        object ListHDFSDirectories  extends  App{
          val logger = org.apache.log4j.Logger.getLogger("org")
          logger.setLevel(Level.WARN)
          val spark = SparkSession.builder()
            .appName(this.getClass.getName)
            .config("spark.master", "local[*]").getOrCreate()
    
          val hdfspath = "." // your path here
          import org.apache.hadoop.fs.{FileSystem, Path}
          val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
          fs.listStatus(new Path(s"${hdfspath}")).filter(_.isDirectory).map(_.getPath).foreach(println)
        }
    

    Result :

    file:/Users/user/codebase/myproject/target
    file:/Users/user/codebase/myproject/Rel
    file:/Users/user/codebase/myproject/spark-warehouse
    file:/Users/user/codebase/myproject/metastore_db
    file:/Users/user/codebase/myproject/.idea
    file:/Users/user/codebase/myproject/src
    
    0 讨论(0)
提交回复
热议问题