问题
The input file to my hadoop M/R job is a text file in which the records are separated by tab character '\t' instead of newline '\n'. How can I instruct hadoop to split using the tab character as by default it splits around newlines and each line in the text file is taken as a record.
One way to do it is to use a custom input format class that uses a filter stream to convert all tabs in the original stream to newlines. But this does not look elegant.
Another way would be to use java.util.Scanner
with tab as the separator. But I cannot figure out how to use the java.util.Scanner
class in the input format classes.
What is the best approach and alternatives?
回答1:
Values '\r' and '\n' hard-coded in org.apache.hadoop.util.LineReader class, so you can't use TextInputFormat with tab-separated records. But it is not difficult to implement own InputFormat with special LineReader class. The simplest solution is to copy-paste TextInputFormat, LineRecordReader and LineReader classes, move them to your package and change LineReader implementation.
来源:https://stackoverflow.com/questions/7271641/how-to-specify-tab-as-a-record-separator-for-hadoop-input-text-file