Cloudera 5.4.2: Avro block size is invalid or too large when using Flume and Twitter streaming

后端未结

关注

 1  670

后悔当初

There is tiny problem when I try Cloudera 5.4.2. Base on this article

Apache Flume - Fetching Twitter Data http://www.tutorialspoint.com/apache_flume/fetching_twitt

相关标签:

1条回答

广开言路

2021-01-06 14:10
Use Cloudera TwitterSource

Otherwise will meet this problem.

Unable to correctly load twitter avro data into hive table

In the article: This is apache TwitterSource
```
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
Twitter 1% Firehose Source
This source is highly experimental. It connects to the 1% sample Twitter Firehose using streaming API and continuously downloads tweets, converts them to Avro format, and sends Avro events to a downstream Flume sink.
```
But it should be cloudera TwitterSource:

https://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/

http://blog.cloudera.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-with-flume/

http://blog.cloudera.com/blog/2012/11/analyzing-twitter-data-with-hadoop-part-3-querying-semi-structured-data-with-hive/
```
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
```
And not just download the pre build jar, because our cloudera version is 5.4.2, otherwise you will get this error:

Cannot run Flume because of JAR conflict

You should compile it using maven

https://github.com/cloudera/cdh-twitter-example

Download and compile: flume-sources.1.0-SNAPSHOT.jar. This jar contains the implementation of Cloudera TwitterSource.

Steps:

wget https://github.com/cloudera/cdh-twitter-example/archive/master.zip

sudo yum install apache-maven Put to flume plugins directory:
```
/var/lib/flume-ng/plugins.d/twitter-streaming/lib/flume-sources-1.0-SNAPSHOT.jar 
```
mvn package

Notice: Yum update to latest version, otherwise compile (mvn package) fails due to some security problem.
0 讨论(0)
发布评论:

提交评论
- 加载中...