Pyspark: get list of files/directories on HDFS path

后端 未结 6 571
野趣味
野趣味 2020-12-05 07:14

As per title. I\'m aware of textFile but, as the name suggests, it works only on text files. I would need to access files/directories inside a path on either HD

相关标签:
6条回答
  • 2020-12-05 07:40

    Using JVM gateway maybe is not so elegant, but in some cases the code below could be helpful:

    URI           = sc._gateway.jvm.java.net.URI
    Path          = sc._gateway.jvm.org.apache.hadoop.fs.Path
    FileSystem    = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
    Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
    
    
    fs = FileSystem.get(URI("hdfs://somehost:8020"), Configuration())
    
    status = fs.listStatus(Path('/some_dir/yet_another_one_dir/'))
    
    for fileStatus in status:
        print(fileStatus.getPath())
    
    0 讨论(0)
  • 2020-12-05 07:40

    There is an easy way to do this using snakebite library

    from snakebite.client import Client
    
    hadoop_client = Client(HADOOP_HOST, HADOOP_PORT, use_trash=False)
    
    for x in hadoop_client.ls(['/']):
    
    ...     print x
    
    0 讨论(0)
  • 2020-12-05 07:47

    If you want to read in all files in a directory, check out sc.wholeTextFiles [doc], but note that the file's contents are read into the value of a single row, which is probably not the desired result.

    If you want to read only some files, then generating a list of paths (using a normal hdfs ls command plus whatever filtering you need) and passing it into sqlContext.read.text [doc] and then converting from a DataFrame to an RDD seems like the best approach.

    0 讨论(0)
  • 2020-12-05 07:56

    I believe it's helpful to think of Spark only as a data processing tool, with a domain that begins at loading the data. It can read many formats, and it supports Hadoop glob expressions, which are terribly useful for reading from multiple paths in HDFS, but it doesn't have a builtin facility that I'm aware of for traversing directories or files, nor does it have utilities specific to interacting with Hadoop or HDFS.

    There are a few available tools to do what you want, including esutil and hdfs. The hdfs lib supports both CLI and API, you can jump straight to 'how do I list HDFS files in Python' right here. It looks like this:

    from hdfs import Config
    client = Config().get_client('dev')
    files = client.list('the_dir_path')
    
    0 讨论(0)
  • 2020-12-05 08:00

    If you use PySpark, you can execute commands interactively:


    List all files from a chosen directory:

    hdfs dfs -ls <path> e.g.: hdfs dfs -ls /user/path:

    import os
    import subprocess
    
    cmd = 'hdfs dfs -ls /user/path'
    files = subprocess.check_output(cmd, shell=True).strip().split('\n')
    for path in files:
      print path
    

    Or search files in a chosen directory:

    hdfs dfs -find <path> -name <expression> e.g.: hdfs dfs -find /user/path -name *.txt:

    import os
    import subprocess
    
    cmd = 'hdfs dfs -find {} -name *.txt'.format(source_dir)
    files = subprocess.check_output(cmd, shell=True).strip().split('\n')
    for path in files:
      filename = path.split(os.path.sep)[-1].split('.txt')[0]
      print path, filename
    
    0 讨论(0)
  • 2020-12-05 08:01

    This might work for you:

    import subprocess, re
    def listdir(path):
        files = str(subprocess.check_output('hdfs dfs -ls ' + path, shell=True))
        return [re.search(' (/.+)', i).group(1) for i in str(files).split("\\n") if re.search(' (/.+)', i)]
    
    listdir('/user/')
    

    This also worked:

    hadoop = sc._jvm.org.apache.hadoop
    fs = hadoop.fs.FileSystem
    conf = hadoop.conf.Configuration()
    path = hadoop.fs.Path('/user/')
    [str(f.getPath()) for f in fs.get(conf).listStatus(path)]
    
    0 讨论(0)
提交回复
热议问题