Pass directories not files to hadoop-streaming?

后端 未结 2 1888
礼貌的吻别
礼貌的吻别 2021-02-15 23:46

In my job, I have the need to parse many historical logsets. Individual customers (there are thousands) may have hundreds of log subdirectories broken out by date. For example

2条回答
  •  礼貌的吻别
    2021-02-16 00:22

    I guess you need to investigate writing a custom InputFormat which you can pass the root directory too, it will create a split for each customer, and then the record reader for each split will do the directory walk and push the file contents to your mappers

提交回复
热议问题