How can one list all csv files in an HDFS location within the Spark Scala shell?

后端 未结 3 633
温柔的废话
温柔的废话 2021-01-05 06:27

The purpose of this is in order to manipulate and save a copy of each data file in a second location in HDFS. I will be using

RddName.coalesce(1).saveAsTex         


        
相关标签:
3条回答
  • 2021-01-05 06:54

    sc.wholeTextFiles(path) should help. It gives an rdd of (filepath, filecontent).

    0 讨论(0)
  • 2021-01-05 06:57

    I haven't tested it thoroughly but something like this seems to work:

    import org.apache.spark.deploy.SparkHadoopUtil
    import org.apache.hadoop.fs.{FileSystem, Path, LocatedFileStatus, RemoteIterator}
    import java.net.URI
    
    val path: String = ???
    
    val hconf = SparkHadoopUtil.get.newConfiguration(sc.getConf)
    val hdfs = FileSystem.get(hconf)
    val iter = hdfs.listFiles(new Path(path), false)
    
    def listFiles(iter: RemoteIterator[LocatedFileStatus]) = {
      def go(iter: RemoteIterator[LocatedFileStatus], acc: List[URI]): List[URI] = {
        if (iter.hasNext) {
          val uri = iter.next.getPath.toUri
          go(iter, uri :: acc)
        } else {
          acc
        }
      }
      go(iter, List.empty[java.net.URI])
    }
    
    listFiles(iter).filter(_.toString.endsWith(".csv"))
    
    0 讨论(0)
  • 2021-01-05 07:02

    This is what ultimately worked for me:

    import org.apache.hadoop.fs._
    import org.apache.spark.deploy.SparkHadoopUtil
    import java.net.URI
    
    val hdfs_conf = SparkHadoopUtil.get.newConfiguration(sc.getConf)
    val hdfs = FileSystem.get(hdfs_conf)
    // source data in HDFS
    val sourcePath = new Path("/<source_location>/<filename_pattern>")
    
    hdfs.globStatus( sourcePath ).foreach{ fileStatus =>
       val filePathName = fileStatus.getPath().toString()
       val fileName = fileStatus.getPath().getName()
    
       // < DO STUFF HERE>
    
    } // end foreach loop
    
    0 讨论(0)
提交回复
热议问题