#Note# Analyzing Twitter Data with Apache Hadoop 系列 1、2、3

Andy erpingwu@gmail.com
2013/09/28-2013/09/30

markdown的语法高亮格式在oschina的blog上有问题，在git.oschina.net上没有问题http://git.oschina.net/wuerping/notes/blob/master/2013/2013-09-30/AnalyzingTwitterDatawithApacheHadoop.md

Analyzing Twitter Data with Apache Hadoop

by Jon Natkins September 19, 2012
http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/

这是这个系列的第一篇，讲的是如何用 Apache Flume, Apache HDFS, Apache Oozie, 和 Apache Hive 去设计一个能够分析 Twitter数据的，端到端的数据 pipeline。

相关代码在 Cloudera Github.

Who is Influential?

Now we know the question we want to ask: Which Twitter users get the most retweets? Who is influential within our industry?
换成山寨版的说法就是：找到谁谁谁是大V

How Do We Answer These Questions?

However, querying Twitter data in a traditional RDBMS is inconvenient, since the Twitter Streaming API outputs tweets in a JSON format which can be arbitrarily complex.
传统数据库也是可用的，不过 Twitter Streaming API 输出的 tweets 是复杂的 JSON format，用起来不方便

The diagram above shows a high-level view of how some of the CDH (Cloudera’s Distribution Including Apache Hadoop) components can be pieced together to build the data pipeline we need to answer the questions we have.

Gathering Data with Apache Flume

数据流的两端是 sources 和 sinks，
每个独立的数据（tweets）补称之为 event。
sources 产生 events, events 通过 channel 从 source 送到 sink，
sink 负责写数据到预定义的位置。

flume 支持的 source

http://flume.apache.org/FlumeUserGuide.html#flume-sources
- Flume Sources
- Avro Source
- Thrift Source
- NetCat Source
- Syslog Sources
- HTTP Source
- Scribe Source
- ...

Partition Management with Oozie

Apache Oozie is a workflow coordination system that can be used to solve this problem.
Oozie is an extremely flexible system for designing job workflows, which can be scheduled to run based on a set of criteria.
We can configure the workflow to run an ALTER TABLE command that adds a partition containing the last hour’s worth of data into Hive, and we can instruct the workflow to occur every hour.

Apache Oozie 用来每小时加 partition

Querying Complex Data with Hive

Hive expects that input files use a delimited row format, but our Twitter data is in a JSON format, which will not work with the defaults.
The schema is only really enforced when we read the data, and we can use the Hive SerDe interface to specify how to interpret what we’ve loaded.
hive 缺省是 delimited row format, 如何处理 JSON format? 使用 Hive SerDe。示例的 JSON 太长，看原文
一个查询语句

SELECT created_at, entities, text, user
FROM tweets
WHERE user.screen_name='ParvezJugon'
  AND retweeted_status.user.screen_name='ScottOstby';

Some Results

一个更复杂的查询语句

Conclusion

Analyzing Twitter Data with Apache Hadoop, Part 2: Gathering Data with Flume

by Jon Natkins October 21, 2012
http://blog.cloudera.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-with-flume/

**这是这个系列的第二篇。第一部分是讲如何将 CDH 的组件整合成一个应用，这一部分是深入说明每个组件 **

Sources

源有两种不同的风格

event-driven
pollable

两者不同之处实际上是推与拉的区别

Event-driven sources typically receive events through mechanisms like callbacks or RPC calls.
Pollable sources, in contrast, operate by polling for events every so often in a loop.

Examining the TwitterSource

Configuring the Flume Agent

Channels

这个例子用的是 Memory Channel

TwitterAgent.channels.MemChannel.type = memory

http://flume.apache.org/FlumeUserGuide.html#flume-channels
- Flume Channels
- Memory Channel
- JDBC Channel
- File Channel
- Pseudo Transaction Channel
- Custom Channel

Sinks

不错的配置功能

TwitterAgent.sinks.HDFS.hdfs.path = hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/

其中的 timestamp 信息来自 TwitterSource 给每个 event 加的 header

headers.put("timestamp", String.valueOf(status.getCreatedAt().getTime()));

Starting the Agent

/etc/default/flume-ng-agent包含一个环境变量FLUME_AGENT_NAME

$ /etc/init.d/flume-ng-agent start

/user/flume/tweets

natty@hadoop1:~/source/cdh-twitter-example$ hadoop fs -ls /user/flume/tweets/2012/09/20/05
  Found 2 items
  -rw-r--r--   3 flume hadoop   255070 2012-09-20 05:30 /user/flume/tweets/2012/09/20/05/FlumeData.1348143893253
  -rw-r--r--   3 flume hadoop   538616 2012-09-20 05:39 /user/flume/tweets/2012/09/20/05/FlumeData.1348143893254.tmp

先写到.tmp文件，当 events 或 time 条件满足时 move 到 roll 文件，参数是：rollCount,rollInterval

Conclusion

Analyzing Twitter Data with Apache Hadoop, Part 3: Querying Semi-structured Data with Apache Hive

by Jon Natkins November 13, 2012
http://blog.cloudera.com/blog/2012/11/analyzing-twitter-data-with-hadoop-part-3-querying-semi-structured-data-with-hive/

这是这个系列的第三篇。讨论Hive的优劣，讨论在这个分析tweets数据的应用中Hive是正确的选择

Characterizing Data

well-structured
unstructured, semi-structured, and poly-structured

Complex Data Structures

SELECT array_column[0] FROM foo;
SELECT map_column[‘map_key’] FROM foo;

SELECT struct_column.struct_field FROM foo;

A Table for Tweets

表设计

CREATE EXTERNAL TABLE tweets (
 ...
 retweeted_status STRUCT<
   text:STRING,
   user:STRUCT>,
 entities STRUCT<
   urls:ARRAY>,
   user_mentions:ARRAY>,
   hashtags:ARRAY>>,
 text STRING,
 ...
)
PARTITIONED BY (datehour INT)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/flume/tweets';


SELECT entities.user_mentions[0].screen_name FROM tweets;

JSON objects 映射到 Hive columns

Serializers and Deserializers

在 Hive , SerDe 是 Serializer 与 Deserializer 两者的缩写

Putting It All Together

One Thing to Watch Out For…

如果它看起来像 duck,声音听起来也像 duck, 所以它肯定是 duck, right? 对于 Hive 的新用户，不能错误地将 Hive 当成关系统型数据库

Conclusion

来源：oschina

链接：https://my.oschina.net/u/1180405/blog/166829

标签

flume

HDFS

Oozie

Hive

#Note# Analyzing Twitter Data with Apache Hadoo...

#Note# Analyzing Twitter Data with Apache Hadoop 系列 1、2、3

Analyzing Twitter Data with Apache Hadoop

Who is Influential?

How Do We Answer These Questions?

Gathering Data with Apache Flume

Partition Management with Oozie

Querying Complex Data with Hive

Some Results

Conclusion

Analyzing Twitter Data with Apache Hadoop, Part 2: Gathering Data with Flume

Sources

Examining the TwitterSource

Configuring the Flume Agent

Channels

Sinks

Starting the Agent

Conclusion

Analyzing Twitter Data with Apache Hadoop, Part 3: Querying Semi-structured Data with Apache Hive

Characterizing Data

Complex Data Structures

A Table for Tweets

Serializers and Deserializers

Putting It All Together

One Thing to Watch Out For…

Conclusion