hadoop converting \r\n to \n and breaking ARC format

后端 未结 1 926
夕颜
夕颜 2021-01-19 09:57

I am trying to parse data from commoncrawl.org using hadoop streaming. I set up a local hadoop to test my code, and have a simple Ruby mapper which uses a streaming ARCfile

相关标签:
1条回答
  • 2021-01-19 10:11

    Looks like the Hadoop PipeMapper.java is to blame (at least in 0.20.2):

    • PipeMapper.java (0.20.2)

    Around line 106, the input from TextInputFormat is passed to this mapper (at which stage the \r\n has been stripped), and the PipeMapper is writing it out to stdout with just a \n.

    A suggestion would be to amend the source for your PipeMapper.java, check this 'feature' still exists, and amend as required (maybe allow it to be set via a configuration property).

    0 讨论(0)
提交回复
热议问题