Programmatically reading the output of Hadoop Mapreduce Program

前端 未结 3 594
忘掉有多难
忘掉有多难 2021-02-09 02:26

This may be a basic question, but I could not find an answer for it on Google.
I have a map-reduce job that creates multiple output files in its output directory. My Java a

相关标签:
3条回答
  • 2021-02-09 03:03
                FSDataInputStream inputStream = fs.open(path);
                BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
                String record;
                while((record = reader.readLine()) != null) {
                    int blankPos = record.indexOf(" ");
                    System.out.println(record+"blankPos"+blankPos);
                    String keyString = record.substring(0, blankPos);
                    String valueString = record.substring(blankPos + 1);
                    System.out.println(keyString + " | " + valueString);
                }
    
    0 讨论(0)
  • 2021-02-09 03:05

    The method you are looking for is called listStatus(Path). It simply returns all files inside of a Path as a FileStatus array. Then you can simply loop over them create a path object and read it.

        FileStatus[] fss = fs.listStatus(new Path("/"));
        for (FileStatus status : fss) {
            Path path = status.getPath();
            SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf);
            IntWritable key = new IntWritable();
            IntWritable value = new IntWritable();
            while (reader.next(key, value)) {
                System.out.println(key.get() + " | " + value.get());
            }
            reader.close();
        }
    

    For Hadoop 2.x you can setup the reader like this:

     SequenceFile.Reader reader = 
               new SequenceFile.Reader(conf, SequenceFile.Reader.file(path))
    
    0 讨论(0)
  • 2021-02-09 03:13

    You have a few options: here are two that I sometimes use.

    Method #1: Depending on your data size, is to make use of the following HDFS commands (found here, Item 6)

    hadoop fs -getmerge hdfs-output-dir local-file
    // example 
    hadoop fs -getmerge /user/kenny/mrjob/ /tmp/mrjob_output
    // another way
    hadoop fs -cat /user/kenny/mrjob/part-r-* > /tmp/mrjob_output
    

    "This concatenates the HDFS files hdfs-output-dir/part-* into a single local file."

    Then you can just read in the one single file. (note that it is in local storage and not HDFS)

    Method #2: Create a helper method: (I have a class called HDFS which contains the Configuration, FileSystem instances as well as other helper methods)

    public List<Path> matchFiles(String path, final String filter) {
            List<Path> matches = new LinkedList<Path>();
            try {
                FileStatus[] statuses = fileSystem.listStatus(new Path(path), new PathFilter() {
                           public boolean accept(Path path) {
                              return path.toString().contains(filter);
                           }
                        });  
                for(FileStatus status : statuses) {
                    matches.add(status.getPath());
                }
            } catch(IOException e) {
            LOGGER.error(e.getMessage(), e);
            }
            return matches;
        }
    

    You can then call via a command like this: hdfs.matchFiles("/user/kenny/mrjob/", "part-")

    0 讨论(0)
提交回复
热议问题