Use Spark to list all files in a Hadoop HDFS directory?

前端 未结 2 984
执念已碎
执念已碎 2021-02-13 13:19

I want to loop through all text files in a Hadoop dir and count all the occurrences of the word \"error\". Is there a way to do a hadoop fs -ls /users/ubuntu/ to li

相关标签:
2条回答
  • 2021-02-13 14:03

    You can use a wildcard:

    val errorCount = sc.textFile("hdfs://some-directory/*")
                       .flatMap(_.split(" ")).filter(_ == "error").count
    
    0 讨论(0)
  • 2021-02-13 14:04
    import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
    import scala.collection.mutable.Stack
    
    
    val fs = FileSystem.get( sc.hadoopConfiguration )
    var dirs = Stack[String]()
    val files = scala.collection.mutable.ListBuffer.empty[String]
    val fs = FileSystem.get(sc.hadoopConfiguration)
    
    dirs.push("/user/username/")
    
    while(!dirs.isEmpty){
        val status = fs.listStatus(new Path(dirs.pop()))
        status.foreach(x=> if(x.isDirectory) dirs.push(x.getPath.toString) else 
        files+= x.getPath.toString)
    }
    files.foreach(println)
    
    0 讨论(0)
提交回复
热议问题