可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm trying to stream twitter data into hdfs using flume and this: https://github.com/cloudera/cdh-twitter-example/

Whatever I try here, it keeps creating files in HDFS that range in size from 1.5kB to 15kB where I would like to see large files (64Mb). Here is the agent configuration:

TwitterAgent.sources = Twitter TwitterAgent.channels = MemChannel TwitterAgent.sinks = HDFS  TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource TwitterAgent.sources.Twitter.channels = MemChannel TwitterAgent.sources.Twitter.consumerKey = xxxxxx TwitterAgent.sources.Twitter.consumerSecret = xxxxxx TwitterAgent.sources.Twitter.accessToken = xxxxx TwitterAgent.sources.Twitter.accessTokenSecret = xxxxxxxx TwitterAgent.sources.Twitter.keywords = test  TwitterAgent.sinks.HDFS.channel = MemChannel TwitterAgent.sinks.HDFS.type = hdfs TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost.localdomain:8020/user/flume/tweets/%Y/%m/%d/%H/ TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text TwitterAgent.sinks.HDFS.hdfs.batchSize = 10000 TwitterAgent.sinks.HDFS.hdfs.rollSize = 67108864 TwitterAgent.sinks.HDFS.hdfs.rollCount = 0 TwitterAgent.sinks.HDFS.hdfs.rollInterval = 0 TwitterAgent.sinks.HDFS.hdfs.idleTimeout = 0  TwitterAgent.channels.MemChannel.type = memory TwitterAgent.channels.MemChannel.capacity = 10000 TwitterAgent.channels.MemChannel.transactionCapacity = 1000

EDIT: I looked into the log files and found this happening all the time:

9:11:27.526 AM WARN org.apache.flume.sink.hdfs.BucketWriter Block Under-replication detected. Rotating file. 9:11:37.036 AM ERROR org.apache.flume.sink.hdfs.BucketWriter

Hit max consecutive under-replication rotations (30); will not continue rolling files under this path due to under-replication

回答1:

It seemed to be a problem with the HDFS replication factor. As I am working on a virtual machine with 1 virtual datanode I had to set the replication factor to 1 in order for it to work as expected.

回答2:

Set dfs.replication on your cluster to an appropriate value. This can be done via editing hdfs-site.xml file (on all machines of cluster). However, this is not enough.

You also need to create hdfs-site.xml file on your flume classpath and put the same dfs.replication value from your cluster in it. Hadoop libraries look at this file while doing operations on the cluster, else they use default values.

dfs.replication2

文章来源: Flume HDFS sink keeps rolling small files

标签

flume

HDFS

small