Hadoop MapReduce intermediate output

前端未结

关注

 2  835

Is there a way to output to log the intermediate (Map Phase) output of a MapReduce Job without editing the Application? (The application is not mine, but the cluster is, and

相关标签:

2条回答

挽巷

2020-12-06 07:57

keep.task.files.pattern parameter can be used to keep the intermediate files. The intermediate files have to be manually cleaned up once the Job has been completed. Since, this is a map/reduce task property, it has to be set in the configuration file and the jar file packaged again.

0 讨论(0)
发布评论:

提交评论
- 加载中...
梦谈多话

2020-12-06 08:18

I don't think the MR framework provides any configuration to save intermediate map output files. Even if such a flag exists, it is not very useful because:

The intermediate output produced by the Maps can't be easily read/used as:
1) Key Value output is serialized before writing to intermediate files.
2) Metadata related to Key Value pairs (Key Length, Value Length, Partition#) is also written to these files (this metadata is in binary format)

An example location of these intermediate files are:
a) Intermediate Intermediate file (Spill output): /yarn/nm/usercache/root/appcache/application_1525687099554_0008/attempt_1525687099554_0008_m_000000_0_spill_0.out
b) Final Intermediate file (Merge Output): /yarn/nm/usercache/root/appcache/application_1525687099554_0008/output/attempt_1525687099554_0008_m_000001_0/file.out

0 讨论(0)
发布评论:

提交评论
- 加载中...