问题
I am writing Hadoop seq file using txt as input. I know how to write Sequence file from text file.
But i want to limit the output sequence file to some specific size say, 256MB.
Is there any inbuilt method to do this?
回答1:
AFIAK you'll need to write your own custom output format to limit output file sizes - by default FileOutputFormats create a single output file per reducer.
Another option is to create your sequence files as normal, then then a second job (map only), with identity mappers and then amend the minimum / maximum input split size to ensure that each mapper only processes ¬256MB each. This will mean a input file og 1GB would be processed by 4 mappers and create output files of ¬256MB. You will get smaller files where an input file is say 300MB (256MB mapper and a 44MB mapper will run).
The properties you are looking for are:
mapred.min.split.size
mapred.max.split.size
They are both configured as byte sizes, so set them both to 268435456
来源:https://stackoverflow.com/questions/15610116/how-to-limit-size-of-hadoop-sequence-file