83 网站点击流数据分析案例(模块开发 - ETL)

那年仲夏 提交于 2019-11-27 02:24:59

该项目的数据分析过程在hadoop集群上实现,主要应用hive数据仓库工具,因此,采集并经过预处理后的数据,需要加载到hive数据仓库中,以进行后续的挖掘分析。

1.创建原始数据表

在hive仓库中建贴源数据表

drop table if exists ods_weblog_origin;
create table ods_weblog_origin(
valid string,
remote_addr string,
remote_user string,
time_local string,
request string,
status string,
body_bytes_sent string,
http_referer string,
http_user_agent string)
partitioned by (datestr string)
row format delimited
fields terminated by '\001';

点击流模型pageviews表

drop table if exists ods_click_pageviews;
create table ods_click_pageviews(
Session string,
remote_addr string,
time_local string,
request string,
visit_step string,
page_staylong string,
http_referer string,
http_user_agent string,
body_bytes_sent string,
status string)
partitioned by (datestr string)
row format delimited
fields terminated by '\001';

间维表创建

drop table dim_time if exists ods_click_pageviews;
create table dim_time(
year string,
month string,
day string,
hour string)
row format delimited
fields terminated by ',';

2.导入数据

导入清洗结果数据到贴源数据表ods_weblog_origin
load data inpath '/weblog/preprocessed/16-02-24-16/' overwrite into table ods_weblog_origin partition(datestr='2013-09-18');

0: jdbc:hive2://localhost:10000> show partitions ods_weblog_origin;
+-------------------+--+
|     partition     |
+-------------------+--+
| timestr=20151203  |
+-------------------+--+

0: jdbc:hive2://localhost:10000> select count(*) from ods_origin_weblog;
+--------+--+
|  _c0   |
+--------+--+
| 11347  |
+--------+--+

导入点击流模型pageviews数据到ods_click_pageviews表
0: jdbc:hive2://hdp-node-01:10000> load data inpath '/weblog/clickstream/pageviews' overwrite into table ods_click_pageviews partition(datestr='2013-09-18');

0: jdbc:hive2://hdp-node-01:10000> select count(1) from ods_click_pageviews;
+------+--+
| _c0  |
+------+--+
| 66   |
+------+--+


导入点击流模型visit数据到ods_click_visit表

导入时间维表:
load data inpath '/dim_time.txt' into table dim_time;

3.生成ODS层明细宽表

需求概述

整个数据分析的过程是按照数据仓库的层次分层进行的,总体来说,是从ODS原始数据中整理出一些中间表(比如,为后续分析方便,将原始数据中的时间、url等非结构化数据作结构化抽取,将各种字段信息进行细化,形成明细表),然后再在中间表的基础之上统计出各种指标数据。

ETL实现

建表——明细表 (源:ods_weblog_origin) (目标:ods_weblog_detail

drop table ods_weblog_detail;
create table ods_weblog_detail(
valid           string, --有效标识
remote_addr     string, --来源IP
remote_user     string, --用户标识
time_local      string, --访问完整时间
daystr          string, --访问日期
timestr         string, --访问时间
month           string, --访问月
day             string, --访问日
hour            string, --访问时
request         string, --请求的url
status          string, --响应码
body_bytes_sent string, --传输字节数
http_referer    string, --来源url
ref_host        string, --来源的host
ref_path        string, --来源的路径
ref_query       string, --来源参数query
ref_query_id    string, --来源参数query的值
http_user_agent string --客户终端标识
)
partitioned by(datestr string);

–抽取refer_url到中间表 "t_ods_tmp_referurl"
–将来访url分离出host path query query id

drop table if exists t_ods_tmp_referurl;
create table t_ ods _tmp_referurl as
SELECT a.*,b.*
FROM ods_origin_weblog a LATERAL VIEW parse_url_tuple(regexp_replace(http_referer, "\"", ""), 'HOST', 'PATH','QUERY', 'QUERY:id') b as host, path, query, query_id; 

–抽取转换time_local字段到中间表明细表”t_ ods _detail”

drop table if exists t_ods_tmp_detail;
create table t_ods_tmp_detail as 
select b.*,substring(time_local,0,10) as daystr,
substring(time_local,11) as tmstr,
substring(time_local,5,2) as month,
substring(time_local,8,2) as day,
substring(time_local,11,2) as hour
From t_ ods _tmp_referurl b;

以上语句可以改写成:

insert into table zs.ods_weblog_detail partition(datestr='$day_01')
select c.valid,c.remote_addr,c.remote_user,c.time_local,
substring(c.time_local,0,10) as daystr,
substring(c.time_local,12) as tmstr,
substring(c.time_local,6,2) as month,
substring(c.time_local,9,2) as day,
substring(c.time_local,11,3) as hour,
c.request,c.status,c.body_bytes_sent,c.http_referer,c.ref_host,c.ref_path,c.ref_query,c.ref_query_id,c.http_user_agent
from
(SELECT 
a.valid,a.remote_addr,a.remote_user,a.time_local,
a.request,a.status,a.body_bytes_sent,a.http_referer,a.http_user_agent,b.ref_host,b.ref_path,b.ref_query,b.ref_query_id 
FROM zs.ods_weblog_origin a LATERAL VIEW parse_url_tuple(regexp_replace(http_referer, "\"", ""), 'HOST', 'PATH','QUERY', 'QUERY:id') b as ref_host, ref_path, ref_query, ref_query_id) c
"
0: jdbc:hive2://localhost:10000> show partitions ods_weblog_detail;
+---------------------+--+
|      partition      |
+---------------------+--+
| dd=18%2FSep%2F2013  |
+---------------------+--+
1 row selected (0.134 seconds)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!