I'm trying to stream twitter data into hdfs using flume and this: https://github.com/cloudera/cdh-twitter-example/
Whatever I try here, it keeps creating files in HDFS that range in size from 1.5kB to 15kB where I would like to see large files (64Mb). Here is the agent configuration:
TwitterAgent.sources = Twitter TwitterAgent.channels = MemChannel TwitterAgent.sinks = HDFS TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource TwitterAgent.sources.Twitter.channels = MemChannel TwitterAgent.sources.Twitter.consumerKey = xxxxxx TwitterAgent.sources.Twitter.consumerSecret = xxxxxx TwitterAgent.sources.Twitter.accessToken = xxxxx TwitterAgent.sources.Twitter.accessTokenSecret = xxxxxxxx TwitterAgent.sources.Twitter.keywords = test TwitterAgent.sinks.HDFS.channel = MemChannel TwitterAgent.sinks.HDFS.type = hdfs TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost.localdomain:8020/user/flume/tweets/%Y/%m/%d/%H/ TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text TwitterAgent.sinks.HDFS.hdfs.batchSize = 10000 TwitterAgent.sinks.HDFS.hdfs.rollSize = 67108864 TwitterAgent.sinks.HDFS.hdfs.rollCount = 0 TwitterAgent.sinks.HDFS.hdfs.rollInterval = 0 TwitterAgent.sinks.HDFS.hdfs.idleTimeout = 0 TwitterAgent.channels.MemChannel.type = memory TwitterAgent.channels.MemChannel.capacity = 10000 TwitterAgent.channels.MemChannel.transactionCapacity = 1000
EDIT: I looked into the log files and found this happening all the time:
9:11:27.526 AM WARN org.apache.flume.sink.hdfs.BucketWriter Block Under-replication detected. Rotating file. 9:11:37.036 AM ERROR org.apache.flume.sink.hdfs.BucketWriter
Hit max consecutive under-replication rotations (30); will not continue rolling files under this path due to under-replication