数仓分层
ODS:Operation Data Store
原始数据
DWD:data warehouse detail(数据清洗)
数据明细详情,去除空置,脏数据,超过极限范围的
明细解析
具体表
DWS:data warehouse service(宽表-用户行为,轻度聚合)
服务层-留存-转化-GMV-复购率-日活
点赞、评论、收藏
轻度聚合对DWD
ADS:Application data store(出报表结果)
做分析同步到RDS数据库里边
数据集市:狭义ADS层,广义上指DWD,DWS,ADS从hadoop同步到RDS的数据
数据搭建之ODS & DWD
1)创建gmall数据库
create database gmall
说明:如果数据库存在且有数据,需要强制删除时执行:drop database gmall cascade;
2)使用gmall数据库
use gmall;
1.ODS层
原始数据层,存放原始数据,直接加载日志、数据,数据保持原貌不做处理
1)创建启动日志表ods_start_log
创建输入数据是lzo输出时text,支持Json解析的分区数据
drop table if exists ods_start_log;
create external table ods_start_log('line' string)
partition by('dt' string)
store as
inputformat 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTestOutputFormat'
location '/warehouse/gmall/ods/ods_start_log'
数据加载
时间格式都配置成YYYY-MM-DD,这是hive默认支持的时间格式
load data inpath '/origin_data/gmall/log/topic_start/2019-02-10' into table gmall.ods_start_log
partition(dt="2019-02-10");
select * from ods_start_log limit 2;
2)创建事件日志表ods_event_log
创建输入数据是lzo输出是text,支持Json解析的分区表
drop table if exists ods_event_log;
create external table ods_event_log('line' string)
partitioned by('dt' string)
stored as
inputformat 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location '/warehouse/gmall/ods/ods_event_log';
加载数据
load data inpath '/origin_data/gmall/log/topic_event/2019-02-10' into table gmall.ods_event_log
patition (dt="2019-02-10");
select * from ods_event_log limit 2;
ODS层加载数据的脚本
在hadoop101的/home/kris/bin目录下创建脚本
vim ods_log.sh
#!/bin/bash
#定义变量方便修改
APP=gmall
hive=/opt/module/hive/bin/hive
#如果是输入的日期按照输入取值,如果没输入日期取当前时间的前一天
if [ -n "$1"]; then
do_data=$1
else
do_data='date -d "-1 day" +%F'
fi
echo "===日志日期为 $do_date==="
sql="
load data inpath '/origin_data/gmall/log/topic_start/$do_date' into table "$APP".ods_start_log partition(dt = '$do_date');
load data inpath '/origin_data/gmall/log/topic_event/$do_date' into table "$APP".ods_event_log partiton(dt = '$do_date');
"
$hive -e "$sql"
增加脚本执行权限:chmod 777 ods_logs.sh
脚本使用:ods_log.sh 2019-02-11
查看导入数据
select * from ods_start_log where dt='2019-02-10' limit 2;
select * from ods_event_log where dt='2019-02-10' limit 2;
脚本执行时间:企业开发中一般在每日凌晨30分-1点
2.DWD数据解析
对ODS层数据进行清洗(去除空值,脏数据,超过极限范围的数据,行十存储改为列式存储,改压缩格式)
DWD解析过程,临时过程,两个临时表:dwd_base_event_log、dwd_base_event_log
建12张表外部表,以日期区分,dwd_base_event_log在这张表中根据event_name将event_json中的字段通过get_json_object函数一个个解析开来
DWD层创建基础明细表
明细表用于存储ODS层原始表转换过来的明细数据
创建启动日志基础明细表:
drop table if exists dwd_base_start_log;
create external table dwd_base_start_log(
'mid_id' string
'user_id' string
'version_code' string
'version_name' string
'lang' string
'source' string
'os' string
'area' string
'model' string
'brand' string
'sdk_version' string
'gmail' string
'height_width' string
'app_time' string
'network' string
'lng' string
'lat' string
'event_name' string
'event_json' string
'event_time' string)
partitioned by ('dt' string)
stored as parquet
location"/warehouse/gmall/dwd/dwd_base_start_log";
其中event_name和event_json用来对应事件名和整个事件。这个地方将原始日志一对多的形式拆分出来了。操作的时候我们将原始数据展平,需要用到UDF和UDTF。
创建事件日志基础明细表
drop table if exists dwd_base_event_log;
create external table dwd_base_event_log(
'mid_id' string
'user_id' string
'version_code' string
'version_name' string
'lang' string
'source' string
'os' string
'area' string
'model' string
'brand' string
'sdk_version' string
'gmail' string
'height_width' string
'app_time' string
'network' string
'lng' string
'lat' string
'event_name' string
'event_json' string
'event_time' string)
partitioned by ('dt' string)
stored as parquet
location"/warehouse/gmall/dwd/dwd_event_start_log";
自定义UDF函数(解析公共字段)
UDF:解析公共字段+事件et(Json数组)+时间戳
自定义UDTF函数(解析具体字段) process一进多出(可支持多进多出)
UDTF:对传入的事件et(Json数组),返回event_name | event_json(取出事件et里边的每个具体事件--Json_Array)
解析启动日志基础明细表
将jar包添加到Hive的classpath
创建临时函数与开发好的java class关联
add jar /opt/module/hive/hviefuction-1.0-SNAPSHOT.jar;
create temporary function base_analizer as "com.atguigu.udf.BaseFieldUDF";
create temporary function flat_analizer as "com.atguigu.udtf.EventJsonUDTF";
set hive.exec.dynamic.partition.mode=nonstrict;
解析启动日志基础明细表
insert overwrite table dwd_base_start_log
partiton (dt)
select
mid_id,user_id,version_code,version_name,lang,source,os,orea,model,brand,sdk_version,gmail,height_width,app_time,network,lng,lat,event_time,event_json,server_time,dt from(
select
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[0] as mid_id,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[1] as user_id,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[2] as version_code,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[3] as version_name,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[4] as lang,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[5] as source,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[6] as os
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[7] as area,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[8] as model,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[9] as brand,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[10] as sdk_version,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[11] as gmail,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[12] as height_width,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[13] as app_time,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[14] as network,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[15] as lng,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[16] as lat,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[17] as ops,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[18] as server_time,
dt
from ods_start_log where dt='2019-02-10' and base_analizer(line,'mid,uid,vc,vn,se,os,ar,md,ba,sv,g,hw,t,nw,ln,la')<>''
)sdk_log lateral view flat_analizer(ops) tmp_k as event_name,event_json;
解析事件日志基础明细表
insert overwrite table dwd_base_event_log
partition(dt='2019-02-10')
select
mid_id,user_id,version_code,version_name,lang,source,os,orea,model,brand,sdk_version,gmail,height_width,app_time,network,lng,lat,event_time,event_json,server_time from(
select
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[0] as mid_id,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[1] as user_id,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[2] as version_code,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[3] as version_name,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[4] as lang,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[5] as source,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[6] as os
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[7] as area,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[8] as model,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[9] as brand,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[10] as sdk_version,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[11] as gmail,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[12] as height_width,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[13] as app_time,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[14] as network,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[15] as lng,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[16] as lat,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[17] as ops,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[18] as server_time
from ods_event_log where dt='2019-02-10' and base_analizer(line,'mid,uid,vc,vn,se,os,ar,md,ba,sv,g,hw,t,nw,ln,la')<>''
) sdk_log lateral view flat_analizer(ops) tmp_k as event_name,event_json;
测试
select * from dwd_base_event_log limit 2;
DWD层数据加载脚本
在hadoop101的/home/kris/bin目录下创建脚本
vim dwd.sh
#! /bin/bash
APP=gamll
hive=/opt/module/hive/bin/hive
if [ -n "$1"] ;then
do_date=$1
else
do_date='date -d "-1 day" + %F'
fi
sql="
add jar /opt/module/hive/hviefuction-1.0-SNAPSHOT.jar;
create temporary function base_analizer as "com.atguigu.udf.BaseFieldUDF";
create temporary function flat_analizer as "com.atguigu.udtf.EventJsonUDTF";
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table "$APP".dwd_base_start_log
partiton (dt)
select
mid_id,user_id,version_code,version_name,lang,source,os,orea,model,brand,sdk_version,gmail,height_width,app_time,network,lng,lat,event_time,event_json,server_time,dt from(
select
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[0] as mid_id,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[1] as user_id,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[2] as version_code,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[3] as version_name,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[4] as lang,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[5] as source,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[6] as os
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[7] as area,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[8] as model,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[9] as brand,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[10] as sdk_version,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[11] as gmail,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[12] as height_width,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[13] as app_time,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[14] as network,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[15] as lng,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[16] as lat,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[17] as ops,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[18] as server_time,
dt
from "$APP".ods_start_log where dt='$do_date' and base_analizer(line,'mid,uid,vc,vn,se,os,ar,md,ba,sv,g,hw,t,nw,ln,la')<>''
)sdk_log lateral view flat_analizer(ops) tmp_k as event_name,event_json;
insert overwrite table "$APP".dwd_base_event_log
partition(dt='$do_date')
select
mid_id,user_id,version_code,version_name,lang,source,os,orea,model,brand,sdk_version,gmail,height_width,app_time,network,lng,lat,event_time,event_json,server_time from(
select
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[0] as mid_id,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[1] as user_id,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[2] as version_code,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[3] as version_name,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[4] as lang,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[5] as source,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[6] as os
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[7] as area,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[8] as model,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[9] as brand,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[10] as sdk_version,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[11] as gmail,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[12] as height_width,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[13] as app_time,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[14] as network,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[15] as lng,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[16] as lat,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[17] as ops,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[18] as server_time
from "$APP".ods_event_log where dt='$do_date' and base_analizer(line,'mid,uid,vc,vn,se,os,ar,md,ba,sv,g,hw,t,nw,ln,la')<>''
) sdk_log lateral view flat_analizer(ops) tmp_k as event_name,event_json;
"
$hive -e "$sql"
chmod +x dwd_base.sh
dwd_base.sh 2019-02-11
查询导入结果
select * from dwd_start_log where dt='2019-02-11' limit 2;
select * from dwd_comment_log where dt='2019-02-11' limit 2;
脚本执行时间
企业开发中一般在每日凌晨30分-1点
3.DWD层
1)商品点击表
建表
drop table if exists dwd_display_log;
create external table dwd_display_log(
'mid_id' string
'user_id' string
'version_code' string
'version_name' string
'lang' string
'source' string
'os' string
'area' string
'model' string
'brand' string
'sdk_version' string
'gmail' string
'height_width' string
'app_time' string
'network' string
'lng' string
'lat' string
'action' string
'newsid' string
'place' string
'extend1' string
'category' string
'server_time' string
)
partitioned by (dt string)
location '/warehouse/gmall/dwd/dwd_display_log/';
导入数据
set hive.exec.dynamic.partition.mode=nostrict;
insert overwrite table dwd_display_log
partition(dt)
select
mid_id,user_id,version_code,version_name,lang,source,os,area,model,brand,sdk_version,gmail,height_width,app_time,network,lng,lat,
get_json_object(event_json,'$.kv.action') action ,
get_json_object(event_json,'$.kv.newsid') newsid,
get_json_object(event_json,'$.kv.place') place,
get_json_object(event_json,'$.kv.extend1') extend1,
get_json_object(event_json,'$.kv.category') category,
service_time,
dt
from dwd_base_event_log
where dt = '2019-02-10' and event_name = 'display';
测试
select * from dwd_display_log limit 2;
2)商品详情页表
3)商品列表页表
4)广告表
5)消息通知表
6)用户前台活跃表
7)用户后台活跃表
8)评论表
9)收藏表
10)点赞表
11)启动日志表
12)错误日志表
DWD层数据加载脚本
在hadoop101的/home/kris/bin目录下创建脚本
vim dwd.sh
#! /bin/bash
APP=gmall
hive=/opt/module/hive/bin/hive
if [ -n "$1"] ; then
do_date=$1
else
do_date='date -d "-1 day" +%F'
fi
sql="
set hive.exec.dynamic.partition.mode=nostrict;
insert overwrite "$APP".table dwd_display_log
partition(dt)
select
mid_id,user_id,version_code,version_name,lang,source,os,area,model,brand,sdk_version,gmail,height_width,app_time,network,lng,lat,
get_json_object(event_json,'$.kv.action') action ,
get_json_object(event_json,'$.kv.newsid') newsid,
get_json_object(event_json,'$.kv.place') place,
get_json_object(event_json,'$.kv.extend1') extend1,
get_json_object(event_json,'$.kv.category') category,
service_time,
dt
from "$APP".dwd_base_event_log
where dt = '$do_date' and event_name = 'display';
"
$hive -e "$sql"
增加脚本执行权限
chmod 777 dwd.sh
脚本使用
dwd.sh 2019-02-11
查询导入结果
select * from dwd_start_log where dt = '2019-02-11' limit 2;
select * from dwd_comment_log where dt = '2019-02-11' limit 2;
脚本执行时间
企业开发中一般在凌晨30分-1点
数据在hdfs上保存时间,半年或一年清理下,可下载压缩保存下来
来源:oschina
链接:https://my.oschina.net/u/4490092/blog/4298530