How to list all files in a directory and its subdirectories in hadoop hdfs

后端 未结 9 1023
故里飘歌
故里飘歌 2020-12-01 05:50

I have a folder in hdfs which has two subfolders each one has about 30 subfolders which,finally,each one contains xml files. I want to list all xml files giving only the mai

相关标签:
9条回答
  • 2020-12-01 06:02

    Code snippet for both recursive and non-recursive approaches:

    //helper method to get the list of files from the HDFS path
    public static List<String>
        listFilesFromHDFSPath(Configuration hadoopConfiguration,
                              String hdfsPath,
                              boolean recursive) throws IOException,
                                            IllegalArgumentException
    {
        //resulting list of files
        List<String> filePaths = new ArrayList<String>();
    
        //get path from string and then the filesystem
        Path path = new Path(hdfsPath);  //throws IllegalArgumentException
        FileSystem fs = path.getFileSystem(hadoopConfiguration);
    
        //if recursive approach is requested
        if(recursive)
        {
            //(heap issues with recursive approach) => using a queue
            Queue<Path> fileQueue = new LinkedList<Path>();
    
            //add the obtained path to the queue
            fileQueue.add(path);
    
            //while the fileQueue is not empty
            while (!fileQueue.isEmpty())
            {
                //get the file path from queue
                Path filePath = fileQueue.remove();
    
                //filePath refers to a file
                if (fs.isFile(filePath))
                {
                    filePaths.add(filePath.toString());
                }
                else   //else filePath refers to a directory
                {
                    //list paths in the directory and add to the queue
                    FileStatus[] fileStatuses = fs.listStatus(filePath);
                    for (FileStatus fileStatus : fileStatuses)
                    {
                        fileQueue.add(fileStatus.getPath());
                    } // for
                } // else
    
            } // while
    
        } // if
        else        //non-recursive approach => no heap overhead
        {
            //if the given hdfsPath is actually directory
            if(fs.isDirectory(path))
            {
                FileStatus[] fileStatuses = fs.listStatus(path);
    
                //loop all file statuses
                for(FileStatus fileStatus : fileStatuses)
                {
                    //if the given status is a file, then update the resulting list
                    if(fileStatus.isFile())
                        filePaths.add(fileStatus.getPath().toString());
                } // for
            } // if
            else        //it is a file then
            {
                //return the one and only file path to the resulting list
                filePaths.add(path.toString());
            } // else
    
        } // else
    
        //close filesystem; no more operations
        fs.close();
    
        //return the resulting list
        return filePaths;
    } // listFilesFromHDFSPath
    
    0 讨论(0)
  • 2020-12-01 06:06

    don't use recursive approach (heap issues) :) use a queue

    queue.add(param_dir)
    while (queue is not empty){
    
      directory=  queue.pop
     - get items from current directory
     - if item is file add to a list (final list)
     - if item is directory => queue.push
    }
    

    that was easy, enjoy!

    0 讨论(0)
  • 2020-12-01 06:09

    Thanks Radu Adrian Moldovan for the suggestion.

    Here is an implementation using queue:

    private static List<String> listAllFilePath(Path hdfsFilePath, FileSystem fs)
    throws FileNotFoundException, IOException {
      List<String> filePathList = new ArrayList<String>();
      Queue<Path> fileQueue = new LinkedList<Path>();
      fileQueue.add(hdfsFilePath);
      while (!fileQueue.isEmpty()) {
        Path filePath = fileQueue.remove();
        if (fs.isFile(filePath)) {
          filePathList.add(filePath.toString());
        } else {
          FileStatus[] fileStatus = fs.listStatus(filePath);
          for (FileStatus fileStat : fileStatus) {
            fileQueue.add(fileStat.getPath());
          }
        }
      }
      return filePathList;
    }
    
    0 讨论(0)
  • 2020-12-01 06:16

    You'll need to use the FileSystem object and perform some logic on the resultant FileStatus objects to manually recurse into the subdirectories.

    You can also apply a PathFilter to only return the xml files using the listStatus(Path, PathFilter) method

    The hadoop FsShell class has examples of this for the hadoop fs -lsr command, which is a recursive ls - see the source, around line 590 (the recursive step is triggered on line 635)

    0 讨论(0)
  • 2020-12-01 06:18

    If you are using hadoop 2.* API there are more elegant solutions:

        Configuration conf = getConf();
        Job job = Job.getInstance(conf);
        FileSystem fs = FileSystem.get(conf);
    
        //the second boolean parameter here sets the recursion to true
        RemoteIterator<LocatedFileStatus> fileStatusListIterator = fs.listFiles(
                new Path("path/to/lib"), true);
        while(fileStatusListIterator.hasNext()){
            LocatedFileStatus fileStatus = fileStatusListIterator.next();
            //do stuff with the file like ...
            job.addFileToClassPath(fileStatus.getPath());
        }
    
    0 讨论(0)
  • 2020-12-01 06:18

    Have you tried this:

    import java.io.*;
    import java.util.*;
    import java.net.*;
    import org.apache.hadoop.fs.*;
    import org.apache.hadoop.conf.*;
    import org.apache.hadoop.io.*;
    import org.apache.hadoop.mapred.*;
    import org.apache.hadoop.util.*;
    
    public class cat{
        public static void main (String [] args) throws Exception{
            try{
                FileSystem fs = FileSystem.get(new Configuration());
                FileStatus[] status = fs.listStatus(new Path("hdfs://test.com:9000/user/test/in"));  // you need to pass in your hdfs path
    
                for (int i=0;i<status.length;i++){
                    BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(status[i].getPath())));
                    String line;
                    line=br.readLine();
                    while (line != null){
                        System.out.println(line);
                        line=br.readLine();
                    }
                }
            }catch(Exception e){
                System.out.println("File not found");
            }
        }
    }
    
    0 讨论(0)
提交回复
热议问题