Programmatically reading the output of Hadoop Mapreduce Program

前端未结

关注

 3  599

忘掉有多难 2021-02-09 02:26

This may be a basic question, but I could not find an answer for it on Google.
I have a map-reduce job that creates multiple output files in its output directory. My Java a

3条回答

滥情空心 (楼主)

2021-02-09 03:13

You have a few options: here are two that I sometimes use.

Method #1: Depending on your data size, is to make use of the following HDFS commands (found here, Item 6)

hadoop fs -getmerge hdfs-output-dir local-file
// example 
hadoop fs -getmerge /user/kenny/mrjob/ /tmp/mrjob_output
// another way
hadoop fs -cat /user/kenny/mrjob/part-r-* > /tmp/mrjob_output

"This concatenates the HDFS files hdfs-output-dir/part-* into a single local file."

Then you can just read in the one single file. (note that it is in local storage and not HDFS)

Method #2: Create a helper method: (I have a class called HDFS which contains the Configuration, FileSystem instances as well as other helper methods)

public List matchFiles(String path, final String filter) {
        List matches = new LinkedList();
        try {
            FileStatus[] statuses = fileSystem.listStatus(new Path(path), new PathFilter() {
                       public boolean accept(Path path) {
                          return path.toString().contains(filter);
                       }
                    });  
            for(FileStatus status : statuses) {
                matches.add(status.getPath());
            }
        } catch(IOException e) {
        LOGGER.error(e.getMessage(), e);
        }
        return matches;
    }

You can then call via a command like this: hdfs.matchFiles("/user/kenny/mrjob/", "part-")

0 讨论(0)

查看其它3个回答