Pass directories not files to hadoop-streaming?

后端 未结 2 1889
礼貌的吻别
礼貌的吻别 2021-02-15 23:46

In my job, I have the need to parse many historical logsets. Individual customers (there are thousands) may have hundreds of log subdirectories broken out by date. For example

相关标签:
2条回答
  • 2021-02-16 00:22

    I guess you need to investigate writing a custom InputFormat which you can pass the root directory too, it will create a split for each customer, and then the record reader for each split will do the directory walk and push the file contents to your mappers

    0 讨论(0)
  • 2021-02-16 00:36

    Hadoop supports input paths to be a regular expression. I haven't experimented with a lot of complex regex, but the simple placeholders ? and * does work.

    So in your case I think if you have the following as your input path it will work :

    file:///mnt/logs/Customer_Name/*/*
    

    The last asterisk might not be needed as all the files in the final directory are automatically added as input path.

    0 讨论(0)
提交回复
热议问题