I have a problem that would be solved by Hadoop Streaming in \"typedbytes\" or \"rawbytes\" mode, which allow one to analyze binary data in a language other than Java. (Without
We solved the binary data issue using hexaencoding the data at split level when streaming down data to the Mapper. This would utilize and increase the Parallel efficiency of your operation instead of first tranforming your data before processing on a node.
Okay, I've found a combination that works, but it's weird.
Prepare a valid typedbytes file in your local filesystem, following the documentation or by imitating typedbytes.py.
Use
hadoop jar path/to/streaming.jar loadtb path/on/HDFS.sequencefile < local/typedbytes.tb
to wrap the typedbytes in a SequenceFile and put it in HDFS, in one step.
Use
hadoop jar path/to/streaming.jar -inputformat org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat ...
to run a map-reduce job in which the mapper gets input from the SequenceFile. Note that -io typedbytes
or -D stream.map.input=typedbytes
should not be used--- explicitly asking for typedbytes leads to the misinterpretation I described in my question. But fear not: Hadoop Streaming splits the input on its binary record boundaries and not on its '\n' characters. The data arrive in the mapper as "rawdata" separated by '\t' and '\n', like this:
If you want to additionally send raw data from mapper to reducer, add
-D stream.map.output=typedbytes -D stream.reduce.input=typedbytes
to your Hadoop command line and format the mapper's output and reducer's expected input as valid typedbytes. They also alternate for key-value pairs, but this time with type characters and without '\t' and '\n'. Hadoop Streaming correctly splits these pairs on their binary record boundaries and groups by keys.
The only documentation on stream.map.output
and stream.reduce.input
that I could find was in the HADOOP-1722 exchange, starting 6 Feb 09. (Earlier discussion considered a different way to parameterize the formats.)
This recipe does not provide strong typing for the input: the type characters are lost somewhere in the process of creating a SequenceFile and interpreting it with the -inputformat
. It does, however, provide splitting at the binary record boundaries, rather than '\n', which is the really important thing, and strong typing between the mapper and the reducer.
Apparently there is a patch for a JustBytes IO mode for streaming, that feeds a whole input file to the mapper command:
https://issues.apache.org/jira/browse/MAPREDUCE-5018