Hadoop read multiple lines at a time

后端 未结 2 555
你的背包
你的背包 2021-02-04 17:54

I have a file in which a set of every four lines represents a record.

eg, first four lines represent record1, next four represent record 2 and so on..

How can I

2条回答
  •  死守一世寂寞
    2021-02-04 18:53

    Another way (easy but may not be efficient in some cases) is to implement the FileInputFormat#isSplitable(). Then the input files are not split and are processed one per map.

    import org.apache.hadoop.fs.*;
    import org.apache.hadoop.mapred.TextInputFormat;
    public class NonSplittableTextInputFormat extends TextInputFormat {
        @Override
        protected boolean isSplitable(FileSystem fs, Path file) {
            return false;
        }
    }
    

    And as orangeoctopus said

    In your map function, add your new key/value pair into a list object. If the list object has 4 items in it, do processing, emit something, then clean out the list. Otherwise, don't emit anything and move on without doing anything.

    This has some overhead for the following reasons

    • Time to process the largest file drags the job completion time.
    • A lot of data may be transferred between the data nodes.
    • The cluster is not properly utilized, since # of maps = # of files.

    ** The above code is from Hadoop : The Definitive Guide

提交回复
热议问题