Hadoop Streaming: Mapper 'wrapping' a binary executable

前端 未结 2 1376
一个人的身影
一个人的身影 2020-12-25 08:58

I have a pipeline that I currently run on a large university computer cluster. For publication purposes I\'d like to convert it into mapreduce format such that it could be

相关标签:
2条回答
  • 2020-12-25 09:16

    After much googling (etc.) I figured out how to include executable binaries/scripts/modules that are accessible to your mappers/reducers. The trick is to upload all you files to hadoop first.

    $ bin/hadoop dfs -copyFromLocal /local/file/system/module.py module.py
    

    Then you need to format you streaming command like the following template:

    $ ./bin/hadoop jar /local/file/system/hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar \
    -file /local/file/system/data/data.txt \
    -file /local/file/system/mapper.py \
    -file /local/file/system/reducer.py \
    -cacheFile hdfs://localhost:9000/user/you/module.py#module.py \
    -input data.txt \
    -output output/ \
    -mapper mapper.py \
    -reducer reducer.py \
    -verbose
    

    If you're linking a python module you'll need to add the following code to your mapper/reducer scripts:

    import sys 
    sys.path.append('.')
    import module
    

    If you're accessing a binary via subprocessing your command should look something like this:

    cli = "./binary %s" % (argument)
    cli_parts = shlex.split(cli)
    mp = Popen(cli_parts, stdin=PIPE, stderr=PIPE, stdout=PIPE)
    mp.communicate()[0]
    

    Hope this helps.

    0 讨论(0)
  • 2020-12-25 09:38

    Got it running finally

    $pid = open2 (my $out, my $in, "./binary") or die "could not run open2";
    
    0 讨论(0)
提交回复
热议问题