I have a pipeline that I currently run on a large university computer cluster. For publication purposes I\'d like to convert it into mapreduce format such that it could be
After much googling (etc.) I figured out how to include executable binaries/scripts/modules that are accessible to your mappers/reducers. The trick is to upload all you files to hadoop first.
$ bin/hadoop dfs -copyFromLocal /local/file/system/module.py module.py
Then you need to format you streaming command like the following template:
$ ./bin/hadoop jar /local/file/system/hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar \
-file /local/file/system/data/data.txt \
-file /local/file/system/mapper.py \
-file /local/file/system/reducer.py \
-cacheFile hdfs://localhost:9000/user/you/module.py#module.py \
-input data.txt \
-output output/ \
-mapper mapper.py \
-reducer reducer.py \
-verbose
If you're linking a python module you'll need to add the following code to your mapper/reducer scripts:
import sys
sys.path.append('.')
import module
If you're accessing a binary via subprocessing your command should look something like this:
cli = "./binary %s" % (argument)
cli_parts = shlex.split(cli)
mp = Popen(cli_parts, stdin=PIPE, stderr=PIPE, stdout=PIPE)
mp.communicate()[0]
Hope this helps.
Got it running finally
$pid = open2 (my $out, my $in, "./binary") or die "could not run open2";