Pass directories not files to hadoop-streaming?

后端未结

关注

 2  1895

In my job, I have the need to parse many historical logsets. Individual customers (there are thousands) may have hundreds of log subdirectories broken out by date. For example

相关标签:

2条回答

礼貌的吻别

2021-02-16 00:22

I guess you need to investigate writing a custom InputFormat which you can pass the root directory too, it will create a split for each customer, and then the record reader for each split will do the directory walk and push the file contents to your mappers

0 讨论(0)
发布评论:

提交评论
- 加载中...
暗喜

2021-02-16 00:36
Hadoop supports input paths to be a regular expression. I haven't experimented with a lot of complex regex, but the simple placeholders ? and * does work.

So in your case I think if you have the following as your input path it will work :
```
file:///mnt/logs/Customer_Name/*/*
```
The last asterisk might not be needed as all the files in the final directory are automatically added as input path.
0 讨论(0)
发布评论:

提交评论
- 加载中...