一、数据仓库(Warehouse)
数据仓库是一个面向主题的、集成的、不可更新的的数据集合 ,它用于支持企业或组织决策分析处理。
数据源:数据仓库系统的基础
数据存储及管理:整个数据仓库系统的核心
OLAP服务器:对分析需要的数据进行有效集成,按多维模型予以组织,以便进行多角度、多层次的分析,并发现趋势。
前端展示:报表工具、查询工具、数据分析工具、数据挖掘工具等
二、HIVE
1.概述
由Facebook开源用于解决海量结构化日志的数据统计,后成为Apache Hive的 一个开源项目Hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射成一张表,并提供类SQL查询功能;
The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.
2.Hive特点
构建在Hadoop之上的数据仓库,数据存储基于Hadoop HDFS;
使用HQL作为查询接口;
使用HDFS存储;
使用MapReduce计算;
本质是:将HQL转化成MapReduce程序
Hive的表就是HDFS上的目录和文件
没有专门的数据存储格式,默认可以直接加载文本文件(TextFile),还支持SequenceFile、RCFile
灵活性和扩展性比较好:支持UDF,自定义存储格式等;
适合离线数据处理
存储结构主要包括:数据库、文件、表、视图、索引
创建表时,指定Hive数据的列分隔符与行分隔符,Hive即可解析数据
3.生态系统位置
4.Hive体系结构
用户接口: Client CLI(hive shell)、JDBC/ODBC(java访问hive),WEBUI(浏览器访问hive)
元数据: Metastore 元数据包括:表名、表所属的数据库(默认是default)、表的拥有者、列/分区字段、表的类型(是否是外部表)、表的数据所在目录等; 默认存储在自带的derby数据库中,推荐使用采用MySQL (Derby数据库不能开启多个实例,可在多窗口个中打开Hive命令行,测试可知,第二个窗口就会报错。故在将来多客户端的情况下,只能有一个客户端连接,其他客户端无法操作。)
Hadoop 使用HDFS进行存储,使用MapReduce进行计算;
驱动器Driver :包含解析器、编译器、优化器、执行器。
解析器,将SQL字符串转换成抽象语法树AST(abstract syntax tree),这一步一般都用第三方工具库完成,比如 antlr;对AST进行语法分析,比如表是否存在、字段是否存在、SQL语义是否有误(比如 select中被判定为聚合的字段在group by中是否有出现);
编译器:将AST编译生成逻辑执行计划;
优化器:对逻辑执行计划进行优化;
执行器:把逻辑执行计划转换成可以运行的物理计划。对于Hive来说,就是MR/TEZ/Spark;
5.Hive对比MapReduce
将结构化的数据文件映射为表
将sql底层解析为MR程序进行运行,避免繁琐的MR编程
存储基于HDFS 计算基于MR
6.Hive优点与使用场景
优点
操作接口采用类SQL语法,提供快速开发的能力(简单、容易上手);
避免了去写MapReduce,减少开发人员的学习成本;
统一的元数据管理,可与impala/spark等共享元数据;
易扩展(HDFS+MapReduce:可以扩展集群规模;支持自定义函数);
数据的离线处理;比如:日志分析,海量结构化数据离线分析
场景
Hive的执行延迟比较高,优势在于处理大数据,对于处理小数据没有优势,因此hive常用于数据分析的,对实时性要求不高的场合;
7.环境搭建
7.1准备
安装配置JDK
下载CDH Hadoop,CDH Hive,以及MYSQL安装包
版本选择:下载地址
hadoop-2.6.0-cdh5.14.2.tar
hive-1.1.0-cdh5.14.2.tar.gz
cdh:Cloudera
由于Hadoop深受客户欢迎,许多公司都推出了各自版本的Hadoop,也有一些公司则围绕Hadoop开发产品。在Hadoop生态系统中,规模最大、知名度最高的公司则是Cloudera。
MYSQL版本(可以选择在线安装)
mysql-5.7.22-1.el6.x86_64.rpm-bundle.tar
[tzhang@elife opt]$ mkdir softwares mkdir: cannot create directory `softwares': Permission denied #为了在后续的操作中,频繁给出权限不足的提示,可以作如下修改,在实际工作中不推荐使用 [tzhang@elife opt]$ sudo chown -R tzhang:tzhang /opt/ #之后就可以以tzhang的用户对/opt文件夹方便的操作
7.2环境配置
7.2.1Java配置
[tzhang@elife jdk1.8.0_144]$ rpm -qa | grep java [tzhang@elife jdk1.8.0_144]$ sudo rpm -e --nodeps # 有则卸载,无就算了 [tzhang@elife jdk1.8.0_144]$ tar -zxvf jdk-8u144-linux-x64.tar.gz -C /opt/install/ [tzhang@elife jdk1.8.0_144]$ sudo vi /etc/profile #编辑配置文件 export JAVA_HOME=/opt/install/jdk1.8.0_144 # 添加jdk install(tar) dir到文件末尾 export PATH=$Path:$JAVA_HOME/bin [tzhang@elife jdk1.8.0_144]$ source /etc/profile
java javac java -version 测试
7.2.2CDH Hadoop配置
etc/hadoop/目录下
[tzhang@elife softwares]$ tar -zxvf hadoop-2.6.0-cdh5.14.2.tar.gz -C /opt/modules # 学生环境
hadoop-env.sh
export JAVA_HOME=/opt/install/jdk1.8.0_144 export HADOOP_PREFIX=/opt/install/hadoop-2.6.0-cdh5.14.2 export HADOOP_CONF_DIR=/opt/install/hadoop-2.6.0-cdh5.14.2/etc/hadoop
core-site.xml
<property> <name>fs.defaultFS</name> <value>hdfs://elife.com:8020</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/opt/install/hadoop-2.6.0-cdh5.14.2/data/tmp</value> </property>
hdfs-site.xml 使用单节点进行学习
<property> <name>dfs.replication</name> <value>1</value> </property>
slaves
elife.com #相应的主机名
yarn-site.xml
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>elife.com</value> </property> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <property> <name>yarn.log-aggregation.retain-seconds</name> <value>86400</value> </property>
mapred-site.xml
<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
yarn-env.sh
export JAVA_HOME=/opt/install/jdk1.8.0_144
mapred-env.sh
export JAVA_HOME=/opt/install/jdk1.8.0_144
7.2.3设置免密登录
关闭防火墙
sudo service iptables status 永久关闭防火墙sudo chkconfig iptables off sudo chkconfig iptables --list
[tzhang@elife ~]$ ssh localhost # 要输入密码 [tzhang@elife ~]$ ssh-keygen -t rsa # 不用输任何东西,直接回车 [tzhang@elife ~]$ ssh-copy-id elife.com # 对应的主机名 [tzhang@elife ~]$ ssh localhost #不需要输入密码
7.2.4Hadoop测试
[tzhang@elife hadoop-2.6.0-cdh5.14.2]$ bin/hdfs namenode -format [tzhang@elife hadoop-2.6.0-cdh5.14.2]$ sbin/start-dfs.sh [tzhang@elife hadoop-2.6.0-cdh5.14.2]$ jps 3824 NameNode 4112 SecondaryNameNode 4226 Jps 3907 DataNode [tzhang@elife hadoop-2.6.0-cdh5.14.2]$ sbin/start-yarn.sh [tzhang@elife hadoop-2.6.0-cdh5.14.2]$ jps 3824 NameNode 4112 SecondaryNameNode 3907 DataNode 4390 Jps 4361 NodeManager 4271 ResourceManager [tzhang@elife hadoop-2.6.0-cdh5.14.2]$ sbin/mr-jobhistory-daemon.sh start historyserver [tzhang@elife hadoop-2.6.0-cdh5.14.2]$ sbin/mr-jobhistory-daemon.sh start historyserver starting historyserver, logging to /opt/install/hadoop-2.6.0-cdh5.14.2/logs/mapred-tzhang-historyserver-elife.com.out [tzhang@elife hadoop-2.6.0-cdh5.14.2]$ jps 3824 NameNode 4752 Jps 4112 SecondaryNameNode 3907 DataNode 4723 JobHistoryServer 4361 NodeManager 4271 ResourceManager [tzhang@elife hadoop-2.6.0-cdh5.14.2]$ bin/hdfs dfs -mkdir /user [tzhang@elife hadoop-2.6.0-cdh5.14.2]$ bin/hdfs dfs -mkdir /user/tzhang # tzhang改成系统用户名
各进程都可以正常启动
以上及将来在运行时,日志中若出现如下内容,请忽略
18/06/07 22:13:34 WARN hdfs.DFSClient: Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1252) at java.lang.Thread.join(Thread.java:1326) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:950)
7.2.5Hive配置
[tzhang@elife hadoop-2.6.0-cdh5.14.2]$ bin/hdfs dfs -mkdir -p /user/hive/warehouse [tzhang@elife hadoop-2.6.0-cdh5.14.2]$ bin/hdfs dfs -chmod g+w /tmp #没有先创建 [tzhang@elife hadoop-2.6.0-cdh5.14.2]$ bin/hdfs dfs -chmod g+w /user/hive/warehouse # 解压 [tzhang@elife hive-1.1.0-cdh5.14.2]$ tar -zxvf hive-1.1.0-cdh5.14.2.tar.gz -C /opt/install/
修改配置
[tzhang@elife conf]$ mv hive-env.sh.template hive-env.sh
HADOOP_HOME=/opt/install/hadoop-2.6.0-cdh5.14.2 # Hive Configuration Directory can be controlled by: export HIVE_CONF_DIR=/opt/install/hive-1.1.0-cdh5.14.2/conf
[tzhang@elife hive-1.1.0-cdh5.14.2]$ bin/hive #hive根目录 which: no hbase in (/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/install/jdk1.8.0_144/bin:/home/tzhang/bin) 18/06/11 05:42:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Logging initialized using configuration in jar:file:/opt/install/hive-1.1.0-cdh5.14.2/lib/hive-common-1.1.0-cdh5.14.2.jar!/hive-log4j.properties WARNING: Hive CLI is deprecated and migration to Beeline is recommended. hive> show databases; OK default Time taken: 72.484 seconds, Fetched: 1 row(s)
Logging initialized using configuration in jar:file:/opt/install/hive-1.1.0-cdh5.14.2/lib/hive-common-1.1.0-cdh5.14.2.jar!/hive-log4j.properties
配置加载指定的hive-log4j.properties文件
[tzhang@elife conf]$ mv hive-log4j.properties.template hive-log4j.properties
重新执行
[tzhang@elife hive-1.1.0-cdh5.14.2]$ bin/hive which: no hbase in (/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/install/jdk1.8.0_144/bin:/home/tzhang/bin) 18/06/11 05:56:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Logging initialized using configuration in file:/opt/install/hive-1.1.0-cdh5.14.2/conf/hive-log4j.properties # 上一行加载文件的不同 WARNING: Hive CLI is deprecated and migration to Beeline is recommended. hive>
在不关闭此命令行窗口的情况下,克隆一个新窗口,执行如下命令
[tzhang@elife hive-1.1.0-cdh5.14.2]$ bin/hive which: no hbase in (/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/install/jdk1.8.0_144/bin:/home/tzhang/bin) 18/06/11 05:48:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Logging initialized using configuration in jar:file:/opt/install/hive-1.1.0-cdh5.14.2/lib/hive-common-1.1.0-cdh5.14.2.jar!/hive-log4j.properties WARNING: Hive CLI is deprecated and migration to Beeline is recommended. hive> show databases; # 此处报错 FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
原因:
Derby数据库不能开启多个实例,故在将来多客户端的情况下,只能有一个客户端连接,其他客户端无法操作,所以hive自带DB部分建议转换为mysql,以支持多客户端。
7.2.6安装模式
嵌入模式
元数据信息被保存在自带的Deybe数据库中
只允许创建一个连接
多用于Demo
本地模式(授课)
元数据信息被保存在MySQL数据库
Metastore与Hive运行在同一台物理机器上
多用于开发和测试
远程模式
元数据信息被保存在MySQL数据库
Metastore与Hive运行在不同台物理机器上
用于实际生产环境
7.2.7Mysql 安装
7.2.7.1卸载自带的mysql
[tzhang@elife softwares]$ rpm -qa|grep mysql mysql-libs-5.1.66-2.el6_3.x86_64 [tzhang@elife softwares]$ sudo rpm -e --nodeps mysql-libs-5.1.66-2.el6_3.x86_64 [sudo] password for tzhang: [tzhang@elife softwares]$ rpm -qa|grep mysql [tzhang@elife softwares]$
7.2.7.2解压tar文件
[tzhang@elife softwares]$ tar -xvf mysql-5.7.22-1.el6.x86_64.rpm-bundle.tar -C /opt/softwares/
7.2.7.3安装特定的rpm文件
[tzhang@elife softwares]$ sudo rpm -ivh mysql-community-common-5.7.22-1.el6.x86_64.rpm [tzhang@elife softwares]$ sudo rpm -ivh mysql-community-libs-5.7.22-1.el6.x86_64.rpm [tzhang@elife softwares]$ sudo rpm -ivh mysql-community-client-5.7.22-1.el6.x86_64.rpm [tzhang@elife softwares]$ sudo rpm -ivh mysql-community-server-5.7.22-1.el6.x86_64.rpm [tzhang@elife softwares]$ sudo rpm -ivh mysql-community-devel-5.7.22-1.el6.x86_64.rpm
安装完毕后
[tzhang@elife softwares]$ sudo service mysqld start # 如果失败,尝试再次启动 [tzhang@elife softwares]$ sudo service mysqld status #查看状态 [tzhang@elife softwares]$ sudo grep 'temporary password' /var/log/mysqld.log #mysql启动,会分配一个随机初始密码,就在var/log/mysqld.log文件中,输入此命令后,会显示出来类似于这样的一段话A temporary password is generated for root@localhost: uu>q%=qqq0A&
7.2.7.4登录与修改密码
[tzhang@elife softwares]$ mysql -uroot -p Enter password: #此处输入默认密码,密码不可见 Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 5 Server version: 5.7.22 Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> show databases; ERROR 1820 (HY000): You must reset your password using ALTER USER statement before executing this statement. mysql> ALTER USER 'root'@'localhost' IDENTIFIED BY 'Abc1234!'; Query OK, 0 rows affected (0.00 sec) mysql> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | mysql | | performance_schema | | sys | +--------------------+ 4 rows in set (0.00 sec) mysql> ALTER USER 'root'@'localhost' IDENTIFIED BY 'Abc1234!'; # 修改密码,MySql验证密码的插件默认是安装的,因此 密码至少8位,至少1个大写,一个小写,一个数字,一个特殊字符 mysql> use mysql; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed mysql> show tables; +---------------------------+ | Tables_in_mysql | +---------------------------+ | columns_priv | | db | | engine_cost |
7.2.7.5使MYSQL对于外部用户也可以访问
5.7之前的版本中,user表中密码字段的字段名是 password,5.7版本改为了 authentication_string,
mysql> select host,user,authentication_string from user; +-----------+---------------+-------------------------------------------+ | host | user | authentication_string | +-----------+---------------+-------------------------------------------+ | localhost | root | *7D4D4010B468CFA62D040406F785A20571D7E323 | | localhost | mysql.session | *THISISNOTAVALIDPASSWORDTHATCANBEUSEDHERE | | localhost | mysql.sys | *THISISNOTAVALIDPASSWORDTHATCANBEUSEDHERE | +-----------+---------------+-------------------------------------------+ 3 rows in set (0.00 sec)
update user set host=‘%’ where user='root';
mysql> select host,user,authentication_string from user; +-----------+---------------+-------------------------------------------+ | host | user | authentication_string | +-----------+---------------+-------------------------------------------+ | % | root | *7D4D4010B468CFA62D040406F785A20571D7E323 | | localhost | mysql.session | *THISISNOTAVALIDPASSWORDTHATCANBEUSEDHERE | | localhost | mysql.sys | *THISISNOTAVALIDPASSWORDTHATCANBEUSEDHERE | +-----------+---------------+-------------------------------------------+ 3 rows in set (0.00 sec) mysql> grant all privileges on *.* to 'root'@'%' identified by '123456' with grant option; Query OK, 0 rows affected, 1 warning (0.00 sec) # *数据库 *表名 mysql> flush privileges; #刷新修改后的信息,使修改生效 Query OK, 0 rows affected (0.00 sec)
7.2.7.6设置开机启动
[tzhang@elife softwares]$ sudo chkconfig mysqld on [sudo] password for tzhang: [tzhang@elife softwares]$ sudo chkconfig mysqld --list mysqld 0:off 1:off 2:on 3:on 4:on 5:on 6:off [tzhang@elife softwares]$
7.3元数据配置(metastore)
mysql安装完之后,配置Hive的metastore存储到mysql
7.3.1hive的conf目录下创建hive-site.xml,参考地址
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://elife.com:3306/metastore?createDatabaseIfNotExist=true&useSSL=false</value> <description>JDBC connect string for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>Driver class name for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> <description>Username to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>123456</value> <description>password to use against metastore database</description> </property> </configuration>
注意
com.mysql.jdbc.Driver驱动是mysql-connector-java 5 com.mysql.cj.jdbc.Driver驱动是mysql-connector-java 6
7.3.2拷贝mysql驱动jar包
[tzhang@elife hive-1.1.0-cdh5.14.2]$ cp /opt/softwares/mysql-connector-java-5.1.46.jar ./lib
7.3.3重新执行bin/hive
[tzhang@elife hive-1.1.0-cdh5.14.2]$ bin/hive which: no hbase in (/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/install/jdk1.8.0_144/bin:/home/tzhang/bin) 18/06/11 09:11:29 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Logging initialized using configuration in file:/opt/install/hive-1.1.0-cdh5.14.2/conf/hive-log4j.properties WARNING: Hive CLI is deprecated and migration to Beeline is recommended. hive> show databases; OK default Time taken: 42.713 seconds, Fetched: 1 row(s) hive>
Mysql中
mysql> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | mysql | | performance_schema | | sys | +--------------------+ 4 rows in set (0.01 sec) mysql> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | cdhmetastore | | mysql | | performance_schema | | sys | +--------------------+ 5 rows in set (0.00 sec)
以上若报出警告
WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
解决办法,在连接地址后面改成
createDatabaseIfNotExist=true&useSSL=false # &为&的转义
7.4Hive 日志
前面介绍过日志的位置,在hive安装目录的conf目录下,hive-log4j.properties,打开此文件,其中有如下的配置
hive.root.logger=WARN,DRFA # WARN 日志级别 hive.log.dir=${java.io.tmpdir}/${user.name} # 该目录为/tmp/tzhang目录 hive.log.file=hive.log #日志文件名
我们可以对他们进行修改
[tzhang@elife hive-1.1.0-cdh5.14.2]$ mkdir logs [tzhang@elife hive-1.1.0-cdh5.14.2]$ cd logs [tzhang@elife logs]$ pwd /opt/install/hive-1.1.0-cdh5.14.2/logs #可将日志输出路径修改为如下路径
7.5简单操作介绍
7.5.1Hive命令行
再次开启两个命令行,检查是否会报错
hive> create table uu(id int,name string);
创建表之后,查看web端warehouse目录表名对应于一个目录
7.5.2与linux交互
hive> !ls /opt/install; hadoop-2.6.0-cdh5.14.2 hive-1.1.0-cdh5.14.2 jdk1.8.0_144
7.5.3与HDFS交互
hive> dfs -ls /user/hive; Found 1 items drwxrwxr-x - tzhang supergroup 0 2018-06-11 09:49 /user/hive/warehouse
操作执行速度比bin/hdfs块
为什么在hive中使用dfs命令操作的速度比bin/hdfs的速度快?
7.5.4hive脚本方式
[tzhang@elife hive-1.1.0-cdh5.14.2]$ bin/hive -help usage: hive -d,--define <key=value> Variable subsitution to apply to hive commands. e.g. -d A=B or --define A=B --database <databasename> Specify the database to use -e <quoted-query-string> SQL from command line -f <filename> SQL from files -H,--help Print help information --hiveconf <property=value> Use value for given property --hivevar <key=value> Variable subsitution to apply to hive commands. e.g. --hivevar A=B -i <filename> Initialization SQL file -S,--silent Silent mode in interactive shell -v,--verbose Verbose mode (echo executed SQL to the console)
-- database 指定进入时具体使用的数据库 -e querycommond 单双引号都可以 -e "show databases" -e "show databases;use ... ;show tables" -e "show databases;use ... ;show tables" >> hive.txt追加到当前目录下的hive.txt文件中,>覆盖写到某文件中 -f 执行文件中的sql语句 bin/hive -f /opt/datas/hive.sql
7.5.5Hive的java Api方式
JAVA API交互执行方式
hive 远程服务 (端口号10000) 启动方式
hive --service hiveserver2
在java代码中调用hive的JDBC建立连接
7.5.6 SET命令
hive> > set hive.cli.print.current.db=true; hive (default)>
当前会话内有效
配置文件配置hive-site.xml,永久有效(显示当前操作的数据库与表信息)
<property> <name>hive.cli.print.header</name> <value>true</value> <description>Whether to print the names of the columns in query output.</description> </property> <property> <name>hive.cli.print.current.db</name> <value>true</value> <description>Whether to include the current database in the Hive prompt.</description> </property>