There is tiny problem when I try Cloudera 5.4.2. Base on this article
Apache Flume - Fetching Twitter Data http://www.tutorialspoint.com/apache_flume/fetching_twitt
Use Cloudera TwitterSource
Otherwise will meet this problem.
Unable to correctly load twitter avro data into hive table
In the article: This is apache TwitterSource
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
Twitter 1% Firehose Source
This source is highly experimental. It connects to the 1% sample Twitter Firehose using streaming API and continuously downloads tweets, converts them to Avro format, and sends Avro events to a downstream Flume sink.
But it should be cloudera TwitterSource:
https://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/
http://blog.cloudera.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-with-flume/
http://blog.cloudera.com/blog/2012/11/analyzing-twitter-data-with-hadoop-part-3-querying-semi-structured-data-with-hive/
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
And not just download the pre build jar, because our cloudera version is 5.4.2, otherwise you will get this error:
Cannot run Flume because of JAR conflict
You should compile it using maven
https://github.com/cloudera/cdh-twitter-example
Download and compile: flume-sources.1.0-SNAPSHOT.jar. This jar contains the implementation of Cloudera TwitterSource.
Steps:
wget https://github.com/cloudera/cdh-twitter-example/archive/master.zip
sudo yum install apache-maven Put to flume plugins directory:
/var/lib/flume-ng/plugins.d/twitter-streaming/lib/flume-sources-1.0-SNAPSHOT.jar
mvn package
Notice: Yum update to latest version, otherwise compile (mvn package) fails due to some security problem.