Hadoop MapReduce provide nested directories as job input

为君一笑 提交于 2019-12-03 01:23:33

I didn't found any document on this but */* works. So it's -input 'path/*/*'.

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

FileInputFormat.setInputDirRecursive(job, true);

No thanks, just call me LeiFeng!

I find recursively going through data can be dangerous since there may be lingering log files from a distcp or something similar. Let me propose an alternative:

Do the recursive walk on the command line, and then pass in the paths in a space-delimited parameter into your MapReduce program. Grab the list from argv:

$ hadoop jar blah.jar "`hadoop fs -lsr recursivepath | awk '{print $8}' | grep '/data.*\.txt' | tr '\n' ' '`"

Sorry for the long bash, but it gets the job done. You could wrap the thing in a bash script to break things out into variables.

I personally like the pass-in-filepath approach to writing my mapreduce jobs so the code itself doesn't have hardcoded paths and it's relatively easy for me to set it up to run against more complex list of files.

Don't know if still relevant but at least in hadoop 2.4.0 you can set property mapreduce.input.fileinputformat.input.dir.recursive to true and it will solve your problem.

just use FileInputFormat.addInputPath("with file pattern"); i am writing my first hadoop prog for graph analysis where input is from diff dir in .gz format ... it worked for me !!!

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!