发表新帖

发表新帖

Hadoop read multiple lines at a time

后端未结

关注

 2  561

你的背包 2021-02-04 17:54

I have a file in which a set of every four lines represents a record.

eg, first four lines represent record1, next four represent record 2 and so on..

How can I

2条回答

死守一世寂寞 (楼主)

2021-02-04 18:53
Another way (easy but may not be efficient in some cases) is to implement the FileInputFormat#isSplitable(). Then the input files are not split and are processed one per map.
```
import org.apache.hadoop.fs.*;
import org.apache.hadoop.mapred.TextInputFormat;
public class NonSplittableTextInputFormat extends TextInputFormat {
    @Override
    protected boolean isSplitable(FileSystem fs, Path file) {
        return false;
    }
}
```
And as orangeoctopus said

In your map function, add your new key/value pair into a list object. If the list object has 4 items in it, do processing, emit something, then clean out the list. Otherwise, don't emit anything and move on without doing anything.

This has some overhead for the following reasons
- Time to process the largest file drags the job completion time.
- A lot of data may be transferred between the data nodes.
- The cluster is not properly utilized, since # of maps = # of files.
** The above code is from Hadoop : The Definitive Guide
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题