I am trying to parse data from commoncrawl.org using hadoop streaming. I set up a local hadoop to test my code, and have a simple Ruby mapper which uses a streaming ARCfile
Looks like the Hadoop PipeMapper.java is to blame (at least in 0.20.2):
Around line 106, the input from TextInputFormat is passed to this mapper (at which stage the \r\n has been stripped), and the PipeMapper is writing it out to stdout with just a \n.
A suggestion would be to amend the source for your PipeMapper.java, check this 'feature' still exists, and amend as required (maybe allow it to be set via a configuration property).