Pass directories not files to hadoop-streaming?

后端 未结 2 1890
礼貌的吻别
礼貌的吻别 2021-02-15 23:46

In my job, I have the need to parse many historical logsets. Individual customers (there are thousands) may have hundreds of log subdirectories broken out by date. For example

2条回答
  •  暗喜
    暗喜 (楼主)
    2021-02-16 00:36

    Hadoop supports input paths to be a regular expression. I haven't experimented with a lot of complex regex, but the simple placeholders ? and * does work.

    So in your case I think if you have the following as your input path it will work :

    file:///mnt/logs/Customer_Name/*/*
    

    The last asterisk might not be needed as all the files in the final directory are automatically added as input path.

提交回复
热议问题