I have a file in which a set of every four lines represents a record.
eg, first four lines represent record1, next four represent record 2 and so on..
How can I
Another way (easy but may not be efficient in some cases) is to implement the FileInputFormat#isSplitable(). Then the input files are not split and are processed one per map.
import org.apache.hadoop.fs.*;
import org.apache.hadoop.mapred.TextInputFormat;
public class NonSplittableTextInputFormat extends TextInputFormat {
@Override
protected boolean isSplitable(FileSystem fs, Path file) {
return false;
}
}
And as orangeoctopus said
In your map function, add your new key/value pair into a list object. If the list object has 4 items in it, do processing, emit something, then clean out the list. Otherwise, don't emit anything and move on without doing anything.
This has some overhead for the following reasons
** The above code is from Hadoop : The Definitive Guide