Flume HDFS sink keeps rolling small files

匿名 (未验证) 提交于 2019-12-03 01:13:01

问题:

I'm trying to stream twitter data into hdfs using flume and this: https://github.com/cloudera/cdh-twitter-example/

Whatever I try here, it keeps creating files in HDFS that range in size from 1.5kB to 15kB where I would like to see large files (64Mb). Here is the agent configuration:

TwitterAgent.sources = Twitter TwitterAgent.channels = MemChannel TwitterAgent.sinks = HDFS  TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource TwitterAgent.sources.Twitter.channels = MemChannel TwitterAgent.sources.Twitter.consumerKey = xxxxxx TwitterAgent.sources.Twitter.consumerSecret = xxxxxx TwitterAgent.sources.Twitter.accessToken = xxxxx TwitterAgent.sources.Twitter.accessTokenSecret = xxxxxxxx TwitterAgent.sources.Twitter.keywords = test  TwitterAgent.sinks.HDFS.channel = MemChannel TwitterAgent.sinks.HDFS.type = hdfs TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost.localdomain:8020/user/flume/tweets/%Y/%m/%d/%H/ TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text TwitterAgent.sinks.HDFS.hdfs.batchSize = 10000 TwitterAgent.sinks.HDFS.hdfs.rollSize = 67108864 TwitterAgent.sinks.HDFS.hdfs.rollCount = 0 TwitterAgent.sinks.HDFS.hdfs.rollInterval = 0 TwitterAgent.sinks.HDFS.hdfs.idleTimeout = 0  TwitterAgent.channels.MemChannel.type = memory TwitterAgent.channels.MemChannel.capacity = 10000 TwitterAgent.channels.MemChannel.transactionCapacity = 1000 

EDIT: I looked into the log files and found this happening all the time:

9:11:27.526 AM WARN org.apache.flume.sink.hdfs.BucketWriter Block Under-replication detected. Rotating file. 9:11:37.036 AM ERROR org.apache.flume.sink.hdfs.BucketWriter

Hit max consecutive under-replication rotations (30); will not continue rolling files under this path due to under-replication

回答1:

It seemed to be a problem with the HDFS replication factor. As I am working on a virtual machine with 1 virtual datanode I had to set the replication factor to 1 in order for it to work as expected.



回答2:

Set dfs.replication on your cluster to an appropriate value. This can be done via editing hdfs-site.xml file (on all machines of cluster). However, this is not enough.

You also need to create hdfs-site.xml file on your flume classpath and put the same dfs.replication value from your cluster in it. Hadoop libraries look at this file while doing operations on the cluster, else they use default values.

dfs.replication2


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!