#Note# Analyzing Twitter Data with Apache Hadoop 系列 1、2、3
Andy erpingwu@gmail.com
2013/09/28-2013/09/30
markdown的语法高亮格式在oschina的blog上有问题,在git.oschina.net上没有问题http://git.oschina.net/wuerping/notes/blob/master/2013/2013-09-30/AnalyzingTwitterDatawithApacheHadoop.md
Analyzing Twitter Data with Apache Hadoop
- by Jon Natkins September 19, 2012
- http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/
这是这个系列的第一篇,讲的是如何用 Apache Flume
, Apache HDFS
, Apache Oozie
, 和 Apache Hive
去设计一个能够分析 Twitter数据的,端到端的数据 pipeline。
- 相关代码在 Cloudera Github.
Who is Influential?
- Now we know the question we want to ask: Which Twitter users get the most retweets? Who is influential within our industry?
- 换成山寨版的说法就是:找到谁谁谁是大V
How Do We Answer These Questions?
- However, querying Twitter data in a traditional RDBMS is inconvenient, since the
Twitter Streaming API
outputs tweets in aJSON format
which can bearbitrarily complex
. - 传统数据库也是可用的,不过 Twitter Streaming API 输出的 tweets 是复杂的 JSON format,用起来不方便
- The diagram above shows a high-level view of how some of the
CDH (Cloudera’s Distribution Including Apache Hadoop)
components can be pieced together to build the data pipeline we need to answer the questions we have.
Gathering Data with Apache Flume
- 数据流的两端是
sources
和sinks
, - 每个独立的数据(tweets)补称之为
event
。 sources
产生events
,events
通过channel
从source
送到sink
,sink
负责写数据到预定义的位置。
flume 支持的 source
- http://flume.apache.org/FlumeUserGuide.html#flume-sources
- Flume Sources
- Avro Source
- Thrift Source
- NetCat Source
- Syslog Sources
- HTTP Source
- Scribe Source
- ...
Partition Management with Oozie
Apache Oozie
is aworkflow coordination system
that can be used to solve this problem.- Oozie is an extremely flexible system for designing
job workflows
, which can be scheduled to run based on a set of criteria. - We can configure the workflow to run an ALTER TABLE command that adds a partition containing the last hour’s worth of data into Hive, and we can instruct the workflow to occur every hour.
Apache Oozie
用来每小时加 partition
Querying Complex Data with Hive
- Hive expects that input files use a
delimited row format
, but our Twitter data is in a JSON format, which will not work with the defaults. - The schema is only really enforced when we read the data, and we can use the
Hive SerDe
interface to specify how to interpret what we’ve loaded. - hive 缺省是 delimited row format, 如何处理 JSON format? 使用 Hive SerDe。示例的 JSON 太长,看原文
- 一个查询语句
SELECT created_at, entities, text, user
FROM tweets
WHERE user.screen_name='ParvezJugon'
AND retweeted_status.user.screen_name='ScottOstby';
Some Results
- 一个更复杂的查询语句
Conclusion
Analyzing Twitter Data with Apache Hadoop, Part 2: Gathering Data with Flume
- by Jon Natkins October 21, 2012
- http://blog.cloudera.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-with-flume/
**这是这个系列的第二篇。第一部分是讲如何将 CDH 的组件整合成一个应用,这一部分是深入说明每个组件 **
Sources
源有两种不同的风格
- event-driven
- pollable
两者不同之处实际上是推与拉的区别
Event-driven sources
typically receive events through mechanisms likecallbacks
orRPC calls
.Pollable sources
, in contrast, operate bypolling
for events every so often in aloop
.
Examining the TwitterSource
Configuring the Flume Agent
Channels
这个例子用的是 Memory Channel
TwitterAgent.channels.MemChannel.type = memory
- http://flume.apache.org/FlumeUserGuide.html#flume-channels
- Flume Channels
- Memory Channel
- JDBC Channel
- File Channel
- Pseudo Transaction Channel
- Custom Channel
Sinks
不错的配置功能
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
其中的 timestamp 信息来自 TwitterSource 给每个 event 加的 header
headers.put("timestamp", String.valueOf(status.getCreatedAt().getTime()));
Starting the Agent
/etc/default/flume-ng-agent包含一个环境变量FLUME_AGENT_NAME
$ /etc/init.d/flume-ng-agent start
/user/flume/tweets
natty@hadoop1:~/source/cdh-twitter-example$ hadoop fs -ls /user/flume/tweets/2012/09/20/05
Found 2 items
-rw-r--r-- 3 flume hadoop 255070 2012-09-20 05:30 /user/flume/tweets/2012/09/20/05/FlumeData.1348143893253
-rw-r--r-- 3 flume hadoop 538616 2012-09-20 05:39 /user/flume/tweets/2012/09/20/05/FlumeData.1348143893254.tmp
先写到.tmp文件,当 events 或 time 条件满足时 move 到 roll 文件, 参数是:rollCount
,rollInterval
Conclusion
Analyzing Twitter Data with Apache Hadoop, Part 3: Querying Semi-structured Data with Apache Hive
- by Jon Natkins November 13, 2012
- http://blog.cloudera.com/blog/2012/11/analyzing-twitter-data-with-hadoop-part-3-querying-semi-structured-data-with-hive/
这是这个系列的第三篇。讨论Hive的优劣,讨论在这个分析tweets数据的应用中Hive是正确的选择
Characterizing Data
well-structured
unstructured
,semi-structured
, andpoly-structured
Complex Data Structures
SELECT array_column[0] FROM foo;
SELECT map_column[‘map_key’] FROM foo;
SELECT struct_column.struct_field FROM foo;
A Table for Tweets
表设计
CREATE EXTERNAL TABLE tweets (
...
retweeted_status STRUCT<
text:STRING,
user:STRUCT>,
entities STRUCT<
urls:ARRAY>,
user_mentions:ARRAY>,
hashtags:ARRAY>>,
text STRING,
...
)
PARTITIONED BY (datehour INT)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/flume/tweets';
SELECT entities.user_mentions[0].screen_name FROM tweets;
JSON objects 映射到 Hive columns
Serializers and Deserializers
在 Hive , SerDe 是 Ser
ializer 与 De
serializer 两者的缩写
Putting It All Together
One Thing to Watch Out For…
如果它看起来像 duck,声音听起来也像 duck, 所以它肯定是 duck, right? 对于 Hive 的新用户,不能错误地将 Hive 当成关系统型数据库
Conclusion
来源:oschina
链接:https://my.oschina.net/u/1180405/blog/166829