Hadoop MapReduce provide nested directories as job input

前端 未结 5 1763
忘了有多久
忘了有多久 2021-02-04 01:08

I\'m working on a job that processes a nested directory structure, containing files on multiple levels:

one/
├── three/
│   └── four/
│       ├── baz.txt
│               


        
5条回答
  •  不知归路
    2021-02-04 01:56

    I find recursively going through data can be dangerous since there may be lingering log files from a distcp or something similar. Let me propose an alternative:

    Do the recursive walk on the command line, and then pass in the paths in a space-delimited parameter into your MapReduce program. Grab the list from argv:

    $ hadoop jar blah.jar "`hadoop fs -lsr recursivepath | awk '{print $8}' | grep '/data.*\.txt' | tr '\n' ' '`"
    

    Sorry for the long bash, but it gets the job done. You could wrap the thing in a bash script to break things out into variables.

    I personally like the pass-in-filepath approach to writing my mapreduce jobs so the code itself doesn't have hardcoded paths and it's relatively easy for me to set it up to run against more complex list of files.

提交回复
热议问题