问题
Lets say i have a MapReduce Job which is creating an output file part-00000
and there is one more job running after the completion of this job.
How can i use the output file of the first job in the Distributed cache for the second job.
回答1:
The below steps might help you,
Pass the first job's output directory path to the Second job's Driver class.
Use Path Filter to list files starts with
part-*
. Refer the below code snippet for your second job driver class,FileSystem fs = FileSystem.get(conf); FileStatus[] fileList = fs.listStatus(new Path("1st job o/p path") , new PathFilter(){ @Override public boolean accept(Path path){ return path.getName().startsWith("part-"); } } );
Iterate over every
part-*
file and add it to distribute cache.for(int i=0; i < fileList.length;i++){ DistributedCache.addCacheFile(new URI(fileList[i].getPath().toUri())); }
来源:https://stackoverflow.com/questions/30224370/how-to-use-a-mapreduce-output-in-distributed-cache