问题
There is tiny problem when I try Cloudera 5.4.2. Base on this article
Apache Flume - Fetching Twitter Data http://www.tutorialspoint.com/apache_flume/fetching_twitter_data.htm
It tries to fetching tweets using Flume and twitter streaming for data analysis. All things are happy, create Twitter app, create directory on HDFS, configure Flume then start to fetch data, create schema on top of tweets.
Then, here is the problem. Twitter streaming converts tweets to Avro format and send Avro events to downsteam HDFS sinks, when Hive table backed by Avro load the data, I got the error message said "Avro block size is invalid or too large".
Oh, what is avro block and the limitation of the block size? Can I change it? What does it mean according to this message? Is it file's fault? Is it some records' fault? If Twitter's streaming met error data, it should core down. If it is all good to convert the tweets to Avro format, reversely, the Avro data should be read correctly, right?
And I also try the avro-tools-1.7.7.jar
java -jar avro-tools-1.7.7.jar tojson FlumeData.1458090051232
{"id":"710300089206611968","user_friends_count":{"int":1527},"user_location":{"string":"1633"},"user_description":{"string":"Steady Building an Empire..... #UGA"},"user_statuses_count":{"int":44471},"user_followers_count":{"int":2170},"user_name":{"string":"Esquire Shakur"},"user_screen_name":{"string":"Esquire_Bowtie"},"created_at":{"string":"2016-03-16T23:01:52Z"},"text":{"string":"RT @ugaunion: .@ugasga is hosting a debate between the three SGA executive tickets. Learn more about their plans to serve you https://t.co/…"},"retweet_count":{"long":0},"retweeted":{"boolean":true},"in_reply_to_user_id":{"long":-1},"source":{"string":"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>"},"in_reply_to_status_id":{"long":-1},"media_url_https":null,"expanded_url":null}
{"id":"710300089198088196","user_friends_count":{"int":100},"user_location":{"string":"DM開放してます(`・ω・´)"},"user_description":{"string":"Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40
at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:275)
at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197)
at org.apache.avro.tool.DataFileReadTool.run(DataFileReadTool.java:77)
at org.apache.avro.tool.Main.run(Main.java:84)
at org.apache.avro.tool.Main.main(Main.java:73)
Caused by: java.io.IOException: Block size invalid or too large for this implementation: -40
at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:266)
... 4 more
The same problem. I google it a lot, no answers at all.
Could anyone give me a solution if you have met this problem too? Or somebody help to give a clue if you fully understand Avro stuff or Twitter streaming underneath.
It is really intereting problem. Think about it.
回答1:
Use Cloudera TwitterSource
Otherwise will meet this problem.
Unable to correctly load twitter avro data into hive table
In the article: This is apache TwitterSource
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
Twitter 1% Firehose Source
This source is highly experimental. It connects to the 1% sample Twitter Firehose using streaming API and continuously downloads tweets, converts them to Avro format, and sends Avro events to a downstream Flume sink.
But it should be cloudera TwitterSource:
https://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/
http://blog.cloudera.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-with-flume/
http://blog.cloudera.com/blog/2012/11/analyzing-twitter-data-with-hadoop-part-3-querying-semi-structured-data-with-hive/
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
And not just download the pre build jar, because our cloudera version is 5.4.2, otherwise you will get this error:
Cannot run Flume because of JAR conflict
You should compile it using maven
https://github.com/cloudera/cdh-twitter-example
Download and compile: flume-sources.1.0-SNAPSHOT.jar. This jar contains the implementation of Cloudera TwitterSource.
Steps:
wget https://github.com/cloudera/cdh-twitter-example/archive/master.zip
sudo yum install apache-maven Put to flume plugins directory:
/var/lib/flume-ng/plugins.d/twitter-streaming/lib/flume-sources-1.0-SNAPSHOT.jar
mvn package
Notice: Yum update to latest version, otherwise compile (mvn package) fails due to some security problem.
来源:https://stackoverflow.com/questions/36053306/cloudera-5-4-2-avro-block-size-is-invalid-or-too-large-when-using-flume-and-twi