How to use “typedbytes” or “rawbytes” in Hadoop Streaming?

后端未结

关注

 3  1688

I have a problem that would be solved by Hadoop Streaming in \"typedbytes\" or \"rawbytes\" mode, which allow one to analyze binary data in a language other than Java. (Without

相关标签:

3条回答

借酒劲吻你

2021-02-02 04:54

We solved the binary data issue using hexaencoding the data at split level when streaming down data to the Mapper. This would utilize and increase the Parallel efficiency of your operation instead of first tranforming your data before processing on a node.

0 讨论(0)
发布评论:

提交评论
- 加载中...
别跟我提以往

2021-02-02 04:57
Okay, I've found a combination that works, but it's weird.
1. Prepare a valid typedbytes file in your local filesystem, following the documentation or by imitating typedbytes.py.
2. Use
```
hadoop jar path/to/streaming.jar loadtb path/on/HDFS.sequencefile < local/typedbytes.tb
```
  to wrap the typedbytes in a SequenceFile and put it in HDFS, in one step.
3. Use
```
hadoop jar path/to/streaming.jar -inputformat org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat ...
```
  to run a map-reduce job in which the mapper gets input from the SequenceFile. Note that -io typedbytes or -D stream.map.input=typedbytes should not be used--- explicitly asking for typedbytes leads to the misinterpretation I described in my question. But fear not: Hadoop Streaming splits the input on its binary record boundaries and not on its '\n' characters. The data arrive in the mapper as "rawdata" separated by '\t' and '\n', like this:
  1. 32-bit signed integer, representing length (note: no type character)
  2. block of raw binary with that length: this is the key
  3. '\t' (tab character... why?)
  4. 32-bit signed integer, representing length
  5. block of raw binary with that length: this is the value
  6. '\n' (newline character... ?)
4. If you want to additionally send raw data from mapper to reducer, add
```
-D stream.map.output=typedbytes -D stream.reduce.input=typedbytes
```
  to your Hadoop command line and format the mapper's output and reducer's expected input as valid typedbytes. They also alternate for key-value pairs, but this time with type characters and without '\t' and '\n'. Hadoop Streaming correctly splits these pairs on their binary record boundaries and groups by keys.
The only documentation on stream.map.output and stream.reduce.input that I could find was in the HADOOP-1722 exchange, starting 6 Feb 09. (Earlier discussion considered a different way to parameterize the formats.)

This recipe does not provide strong typing for the input: the type characters are lost somewhere in the process of creating a SequenceFile and interpreting it with the -inputformat. It does, however, provide splitting at the binary record boundaries, rather than '\n', which is the really important thing, and strong typing between the mapper and the reducer.
0 讨论(0)
发布评论:

提交评论
- 加载中...
忘掉有多难

2021-02-02 05:07

Apparently there is a patch for a JustBytes IO mode for streaming, that feeds a whole input file to the mapper command:

https://issues.apache.org/jira/browse/MAPREDUCE-5018

0 讨论(0)
发布评论:

提交评论
- 加载中...