How to read text source in hadoop separated by special character

谁说胖子不能爱 提交于 2019-12-11 09:49:15

问题


My data format uses \0 instead of new line. So default hadoop textLine reader dosn't work. How can I configure it to read lines separated by special character?

If it is impossible to configure LineReader, Maybe it is possible to apply specic stream processor(tr "\0" "\n") not sure how to do this.


回答1:


You can write your own InputFormat class that splits data on \0 instead of \n. For a walkthrough on how to do that, check here: http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat

The gist of it is that you need to subclass the default InputFormat class, or any of its subclasses, and define your own RecordReader with custom rules. For more on that, you can refer to the InputFormat documentation.




回答2:


There is a "textinputformat.record.delimiter" configuration property for that purpose. You can change default EOL ("\n") delimiter by changing this property value to "\0".

For more information, go here: http://amalgjose.wordpress.com/2013/05/27/custom-text-input-format-record-delimiter-for-hadoop

There is also a similar question for about changing the default delimiter in spark, which may be useful too: Setting textinputformat.record.delimiter in spark




回答3:


How about, would using a TextDelimited scheme work? http://docs.cascading.org/cascading/1.2/javadoc/cascading/scheme/TextDelimited.html

That avoids having to write your own InputFormat, etc.

Examples of text delimited are in https://github.com/Cascading/Impatient/wiki/Part-2



来源:https://stackoverflow.com/questions/12118836/how-to-read-text-source-in-hadoop-separated-by-special-character

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!