ELK日志分析平台（一）ELK简介、ElasticSearch集群

* ELK简介：

ELK是Elasticsearch 、 Logstash、Kibana三个开源软件的缩写。ELK Stack 5.0版本之后新增Beats工具，因此，ELK Stack也改名为Elastic Stack，Beats是一个轻量级的日志收集处理工具(Agent)，Beats占用资源少，适合于在各个服务器上搜集日志后传输给Logstash，也是官方推荐的工具。

Elastic Stack包含：

Elasticsearch：是一个基于 Lucene 构建的开源分布式搜索引擎，提供搜集、分析、存储数据三大功能。特点：分布式，零配置，自动发现，索引自动分片，索引副本机制，restful风格接口，多数据源，自动搜索负载等，作为 ELK 的核心，它集中存储数据。
Logstash：主要是日志的搜集、分析、过滤日志的工具，支持以 TCP/UDP/HTTP 多种方式收集数据（也可以接受 Beats 传输来的数据），并对数据做进一步丰富或提取字段处理。一般工作方式为c/s架构，client端安装在需要收集日志的主机上，server端负责将收到的各节点日志进行过滤、修改等操作，再一并发往elasticsearch。
Kibana：是一个开源的分析和可视化工具，可以为 Logstash 和 ElasticSearch 提供日志分析友好的 Web 界面，可以帮助汇总、分析和搜索重要数据日志。
Beats：轻量级日志采集器，早期的ELK架构中使用Logstash收集、解析日志，但是Logstash对内存、cpu、io等资源消耗比较高。相比 Logstash，Beats所占系统的CPU和内存几乎可以忽略不计，目前Beats包含六种工具：
Packetbeat：网络数据（收集网络流量数据）
Metricbeat：指标（收集系统、进程和文件系统级别的 CPU 和内存使用情况等数据）
Filebeat：日志文件（收集文件数据）
Winlogbeat： windows事件日志（收集 Windows 事件日志数据）
Auditbeat：审计数据（收集审计日志）
Heartbeat：运行时间监控（收集系统运行时的数据）

x-pack工具

x-pack对Elastic Stack提供了安全、警报、监控、报表、图表于一身的扩展包，是收费的。

Elastic Stack官网：https://www.elastic.co/cn/

ELK 架构

ELK日志分析平台（一）ELK简介、ElasticSearch集群

安装 ELK 时，各应用最好选择统一的版本，避免出现一些莫名其妙的问题。例如：由于版本不统一，导致三个应用间的通讯异常。

ElasticSearch集群

实验环境：

系统：

[root@localhost soft]# uname -r
3.10.0-862.el7.x86_64
[root@localhost soft]#

IP：10.15.97.136-138

* java环境安装

java环境必须是1.8版本以上

[root@localhost soft]# rpm -ivh jdk-8u161-linux-x64.rpm

* ElasticSearch安装

yum安装

[root@localhost soft]# rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch
[root@localhost soft]# cat /etc/yum.repos.d/elastic.repo
[elasticsearch-6.x]
name=Elasticsearch repository for 6.x packages
baseurl=https://artifacts.elastic.co/packages/6.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=1
autorefresh=1
type=rpm-md
[root@localhost soft]# yum install -y elasticsearch

手动安装

[root@localhost soft]# wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.5.1.tar.gz
[root@localhost soft]# tar -zxvf elasticsearch-6.5.1.tar.gz -C /app/
[root@localhost soft]# cd /app/elasticsearch-6.5.1/
[root@localhost elasticsearch-6.5.1]# mkdir data                      //存放data数据的目录
[root@localhost elasticsearch-6.5.1]# useradd elasticsearch
[root@localhost elasticsearch-6.5.1]# echo "elasticsearch"|passwd --stdin elasticsearch

修改elasticsearch的配置文件

elasticsearch的config文件夹里面有两个配置文件：elasticsearch.yml（es的基本配置文件）和logging.yml（日志配置文件，es也是使用log4j来记录日志）

[root@localhost elasticsearch-6.5.1]# cd config/
[root@localhost config]# vim elasticsearch.yml 
# ---------------------------------- Cluster -----------------------------------
cluster.name: ES-Justin
# ------------------------------------ Node ------------------------------------
node.name: ES-Justin-master
# ----------------------------------- Paths ------------------------------------
path.data: /app/elasticsearch-6.5.1/data
path.logs: /app/elasticsearch-6.5.1/logs
# ----------------------------------- Memory -----------------------------------
bootstrap.memory_lock: false
# ---------------------------------- Network -----------------------------------
network.host: 10.15.97.136
http.port: 9200
transport.tcp.port: 9300
# --------------------------------- Discovery ----------------------------------
discovery.zen.ping.unicast.hosts: ["10.15.97.136", "10.15.97.137","10.15.97.138"]
discovery.zen.minimum_master_nodes: 2
# ---------------------------------- Various -----------------------------------
#action.destructive_requires_name: true
discovery.zen.commit_timeout: 100s  #以下是集群参数调整
discovery.zen.publish_timeout: 100s
discovery.zen.ping_timeout: 100s
discovery.zen.fd.ping_timeout: 100s
discovery.zen.fd.ping_interval: 10s
discovery.zen.fd.ping_retries: 10
action.destructive_requires_name: true #集群安全设置

http.cors.enabled: true
http.cors.allow-origin: "*"
[root@localhost config]# chown -R elasticsearch:elasticsearch /app/elasticsearch-6.5.1/

将以上/app/elasticsearch-6.5.1文件夹copy到另外2台机器。

elasticsearch.yml说明：

    cluster.name:ES-Justin      #es的集群名称，默认是elasticsearch，es会自动发现在同一网段下的es，如果在同一网段下有多个集群，就可以用这个属性来区分不同的集群。
    node.name:”ES-Justin-master”            #节点名称，默认随机指定一个name列表中名字，该列表在es的jar包中config文件夹里name.txt文件中。
    node.master:true                #指定该节点是否有资格被选举成为node，默认是true，es是默认集群中的第一台机器为master，如果这台机挂了就会重新选举master。
    node.data:true                  #指定该节点是否存储索引数据，默认为true。
    node.ingest: true

node.master、node.data、node.ingest有4种组合：
node.master: true
node.data: true
node.ingest: true
elasticsearch默认配置，表示这个节点既有成为主节点的资格，又可以存储数据，还可以作为预处理节点，这个时候如果某个节点被选举成为了真正的主节点，那么他还要存储数据，这样相当于主节点和数据节点的角色混合到一块了，节点的压力就比较大了。
node.master: false
node.data: true
node.ingest: false
表示这个节点没有成为主节点的资格，也就不参与选举，只会存储数据。这个节点称为 data(数据)节点。在集群中需要单独设置几个这样的节点负责存储数据。后期提供存储和查询服务
node.master: true
node.data: false
node.ingest: false
表示这个节点不会存储数据，有成为主节点的资格，可以参与选举，有可能成为真正的主节点。这个节点称为master节点
node.master: false
node.data: false
node.ingest: true
表示这个节点即不会成为主节点，也不会存储数据，这个节点的意义是作为一个 client(客户端)节点，主要是针对海量请求的时候可以进行负载均衡。在ElasticSearch5.x 之后该节点称之为：coordinate 节点，其中还增加了一个叫：ingest 节点，用于预处理数据（索引和搜索阶段都可以用到），作为一般应用是不需要这个预处理节点做什么额外的预处理过程，这个节点和 client 节点之间可以看做是等同的，在代码中配置访问节点就都可以配置这些 ingest 节点即可。

建议集群中设置 3台以上的节点作为 master 节点【node.master: true node.data: false node.ingest:false】，这些节点只负责成为主节点，维护整个集群的状态。再根据数据量设置一批 data节点【node.master: false node.data: true node.ingest:false】，这些节点只负责存储数据，后期提供建立索引和查询索引的服务，这样的话如果用户请求比较频繁，这些节点的压力也会比较大，所以在集群中建议再设置一批 ingest 节点也称之为 client 节点【node.master: false node.data: false node.ingest:true】，这些节点只负责处理用户请求，实现请求转发，负载均衡等功能。
master节点：普通服务器即可(CPU 内存消耗一般)
data节点：主要消耗磁盘，内存
client | ingest 节点：普通服务器即可(如果要进行分组聚合操作的话，建议这个节点内存也分配多一点

    index.number_of_shards:5        #设置默认索引分片个数，默认为5片。
    index.number_of_replicas:1      #设置默认索引副本个数，默认为1个副本。
    path.conf:/path/to/conf         #设置配置文件的存储路径，默认是es根目录下的config文件夹。
    path.data:/path/to/data         #设置索引数据的存储路径，默认是es根目录下的data文件夹，可以设置多个存储路径，用逗号隔开，例：
    path.data:/path/to/data1,/path/to/data2
    path.work:/path/to/work         #设置临时文件的存储路径，默认是es根目录下的work文件夹。
    path.logs:/path/to/logs         #设置日志文件的存储路径，默认是es根目录下的logs文件夹
    path.plugins:/path/to/plugins   #设置插件的存放路径，默认是es根目录下的plugins文件夹
    bootstrap.mlockall:true         #设置为true来锁住内存。因为当jvm开始swapping时es的效率会降低，所以要保证它不swap，可以把ES_MIN_MEM和ES_MAX_MEM两个环境变量设置成同一个值，并且保证机器有足够的内存分配给es。同时也要允许elasticsearch的进程可以锁住内存，linux下可以通过`ulimit-lunlimited`命令。
    network.bind_host:192.168.0.1   #设置绑定的ip地址，可以是ipv4或ipv6的，默认为0.0.0.0。
    network.publish_host:192.168.0.1 #设置其它节点和该节点交互的ip地址，如果不设置它会自动判断，值必须是个真实的ip地址。
    network.host:192.168.0.1        #这个参数是用来同时设置bind_host和publish_host上面两个参数。
    transport.tcp.port:9300         #设置节点间交互的tcp端口，默认是9300。
    transport.tcp.compress:true     #设置是否压缩tcp传输时的数据，默认为false，不压缩。
    http.port:9200                  #设置对外服务的http端口，默认为9200。
    http.max_content_length:100mb   #设置内容的最大容量，默认100mb
    http.enabled:false              #是否使用http协议对外提供服务，默认为true，开启。
    gateway.type:local              #gateway的类型，默认为local即为本地文件系统，可以设置为本地文件系统，分布式文件系统，hadoop的HDFS，和amazon的s3服务器，其它文件系统的设置方法下次再详细说。
    gateway.recover_after_nodes:1   #设置集群中N个节点启动时进行数据恢复，默认为1。
    gateway.recover_after_time:5m   #设置初始化数据恢复进程的超时时间，默认是5分钟。
    gateway.expected_nodes:2        #设置这个集群中节点的数量，默认为2，一旦这N个节点启动，就会立即进行数据恢复。
    cluster.routing.allocation.node_initial_primaries_recoveries:4  #初始化数据恢复时，并发恢复线程的个数，默认为4。
    cluster.routing.allocation.node_concurrent_recoveries:2         #添加删除节点或负载均衡时并发恢复线程的个数，默认为4。
    indices.recovery.max_size_per_sec:0                             #设置数据恢复时限制的带宽，如入100mb，默认为0，即无限制。
    indices.recovery.concurrent_streams:5                           #设置这个参数来限制从其它分片恢复数据时最大同时打开并发流的个数，默认为5。
    discovery.zen.minimum_master_nodes:1                            #设置这个参数来保证集群中的节点可以知道其它N个有master资格的节点。默认为1，对于大的集群来说，可以设置大一点的值（2-4）
    discovery.zen.ping.timeout:3s                                   #设置集群中自动发现其它节点时ping连接超时时间，默认为3秒，对于比较差的网络环境可以高点的值来防止自动发现时出错。
    discovery.zen.ping.multicast.enabled:false                      #设置是否打开多播发现节点，默认是true。
    discovery.zen.ping.unicast.hosts:[“host1″,”host2:port”,”host3[portX-portY]”]    #设置集群中master节点的初始列表，可以通过这些节点来自动发现新加入集群的节点       
    node.max_local_storage_nodes: 2 # 多个节点可以在同一个安装路径启动
    bootstrap.memory_lock: false        #配置内存使用用交换分区
    http.cors.enabled: true         #增加新的参数，是否支持跨域，这样head插件可以访问es (5.x版本，如果没有可以自己手动加)
    http.cors.allow-origin: "*"     #增加新的参数，*表示支持所有域名，这样head插件可以访问es (5.x版本，如果没有可以自己手动加)

启动ElasticSearch

[root@localhost bin]# su - elasticsearch
[elasticsearch@localhost ~]$ cd /app/elasticsearch-6.5.1/bin/
[elasticsearch@localhost bin]$ ./elasticsearch -d

注意：启动的时候要以非root用户启动，否则会报错。如果非要用root启动，在启动的时候添加参数：./elasticsearch -Des.insecure.allow.root=true

日志查看

ElasticSearch的日志的名称是以集群名称命名的。

[elasticsearch@localhost bin]$ tail -500f ../logs/ES-Justin.log

验证

[root@localhost bin]# curl http://10.15.97.136:9200     节点信息
[root@localhost bin]# curl 'http://10.15.97.136:9200/_cluster/health?pretty'       集群的健康检查
{
  "cluster_name" : "ES-Justin",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 6,
  "active_shards" : 12,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}
[root@localhost bin]# curl 'http://10.15.97.136:9200/_cluster/state?pretty'            集群的详细信息

ElasticSearch-head插件安装

[root@localhost app]# yum install -y npm
[root@localhost app]# git clone git://github.com/mobz/elasticsearch-head.git
[root@localhost app]# cd elasticsearch-head
[root@localhost elasticsearch-head]# npm install -g grunt --registry=https://registry.npm.taobao.org       安装grunt
[root@localhost elasticsearch-head]# npm install

在elasticsearch-head/node_modules/grunt下如果没有grunt二进制程序，需要执行：npm install grunt --save
修改 elasticsearch-head/Gruntfile.js文件

[root@localhost elasticsearch-head]# vim Gruntfile.js
connect: {
                        server: {
                                options: {
                                        hostname: '10.15.97.136',    #新增
                                        port: 9100,
                                        base: '.',
                                        keepalive: true
                                }
                        }
                }

修改 elasticsearch-head/_site/app.js 中http://localhost:9200字段到本机ES端口与IP

[root@localhost elasticsearch-head]# vim _site/app.js
init: function(parent) {
                        this._super();
                        this.prefs = services.Preferences.instance();
                        this.base_uri = this.config.base_uri || this.prefs.get("app-base_uri") || "http://10.15.97.136:9200";
                        if( this.base_uri.charAt( this.base_uri.length - 1 ) !== "/" ) {
                                // XHR request fails if the URL is not ending with a "/"
                                this.base_uri += "/";
                        }

启动head

[root@localhost elasticsearch-head]# cd /app/elasticsearch-head/node_modules/grunt/bin
[root@localhost bin]# ./grunt server &
[1] 17780
        [root@localhost bin]# Running "connect:server" (connect) task
        Waiting forever...
        Started connect web server on http://10.15.97.136:9100
[root@localhost bin]# netstat -antp |grep 9100
tcp        0      0 10.15.97.136:9100       0.0.0.0:*               LISTEN      17780/grunt         
[root@localhost bin]#

* ElasticSearch 性能优化

系统级别调优

去掉文件句柄限制

Linux中，每个进程默认打开的最大文件句柄数是1000，

[root@localhost bin]# vim /etc/security/limits.conf
elasticsearch hard nofile 65536
elasticsearch soft nofile 131072
elasticsearch hard nproc 4096
elasticsearch soft nproc 4096
* hard memlock unlimited
* soft memlock unlimited
[root@localhost ~]# vim /etc/pam.d/login
session    required     /lib/security/pam_limits.so
[root@localhost ~]# ulimit -l
64
[root@localhost ~]# reboot
[root@localhost ~]# ulimit -l
unlimited
[root@localhost ~]#

虚拟内存设置

max_map_count定义了进程能拥有的最多内存区域，默认启动ES会提示：
max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

[root@localhost ~]# vim /etc/sysctl.conf
vm.max_map_count=655360
[root@localhost ~]# sysctl -p

尽量不用交换空间

如果不禁用交换空间，应控制操作系统尝试交换内存的积极性，可以尝试降低swappiness。该值控制操作系统尝试交换内存的积极性。这可以防止在正常情况下交换，但仍然允许操作系统在紧急内存情况下进行交换。

[root@localhost ~]# vim /etc/sysctl.conf
vm.swappiness = 1
[root@localhost ~]#

1的swappiness优于0，因为在某些内核版本上，swappiness为0可以调用OOM杀手。

ElasticSearch级别调优

ElasticSearch内存

[root@localhost ~]# vim /app/elasticsearch-6.5.1/config/jvm.options
-Xms2g
-Xmx2g
[root@localhost ~]#

ES内存建议采用分配机器物理内存的一半（Lucene利用底层操作系统来缓存内存中的数据结构，Lucene的性能依赖于与操作系统的这种交互，如果把所有可用的内存都给了Elasticsearch的堆，Lucene就不会有任何剩余的内存。会严重影响性能。），但最大不要超过32GB，如何判断内存设置是否恰当，看ES启动日志中的：

[elasticsearch@localhost bin]$ grep "compressed ordinary object pointers" ../logs/ES-Justin.log 
[2018-12-05T12:41:58,098][INFO ][o.e.e.NodeEnvironment    ] [ES-Justin-Salve2] heap size [1.9gb], compressed ordinary object pointers [true]
[elasticsearch@localhost bin]$

如果[true]，则表示ok。一半超过32GB会为[false]，请依次降低内存设置试试。

堆内存为什么不能超过物理机内存的一半？

堆对于Elasticsearch绝对重要，被许多内存数据结构用来提供快速操作，但还有另外一个非常重要的内存使用者Lucene。
Lucene旨在利用底层操作系统来缓存内存中的数据结构。 Lucene段(segment)存储在单个文件中。因为段是一成不变的，所以这些文件永远不会改变。这使得它们非常容易缓存，并且底层操作系统将愉快地将热段（hot segments）保留在内存中以便更快地访问。这些段包括倒排索引（用于全文搜索）和文档值（用于聚合）。
Lucene的性能依赖于与操作系统的这种交互。但是如果把所有可用的内存都给了Elasticsearch的堆，那么Lucene就不会有任何剩余的内存。这会严重影响性能。建议是将可用内存的50％提供给Elasticsearch堆，而将其他50％空闲。它不会被闲置; Lucene会高兴地吞噬掉剩下的东西。

堆内存为什么不能超过32GB

在Java中，所有对象都分配在堆上并由指针引用。普通的对象指针（OOP）指向这些对象，传统上它们是CPU本地字的大小：32位或64位，取决于处理器。
对于32位系统，这意味着最大堆大小为4 GB。对于64位系统，堆大小可能会变得更大，但是64位指针的开销意味着仅仅因为指针较大而存在更多的浪费空间。并且比浪费的空间更糟糕，当在主存储器和各种缓存（LLC，L1等等）之间移动值时，较大的指针消耗更多的带宽。
Java使用称为压缩oops的技巧来解决这个问题。而不是指向内存中的确切字节位置，指针引用对象偏移量。这意味着一个32位指针可以引用40亿个对象，而不是40亿个字节。最终，这意味着堆可以增长到约32 GB的物理尺寸，同时仍然使用32位指针。
一旦你穿越了这个神奇的〜32 GB的边界，指针就会切换回普通的对象指针。每个指针的大小增加，使用更多的CPU内存带宽，并且实际上会丢失内存。实际上，在使用压缩oops获得32 GB以下堆的相同有效内存之前，需要大约40-50 GB的分配堆。
小结：即使你有足够的内存空间，尽量避免跨越32GB的堆边界；否则会导致浪费了内存，降低了CPU的性能，并使GC在大堆中挣扎。

堆内存优化建议
方式一：最好的办法是在系统上完全禁用交。
方式二：控制操作系统尝试交换内存的积极性。
方式三：mlockall允许JVM锁定其内存并防止其被操作系统交换。

给ES分配的内存有一个魔法上限值26GB，这样可以确保启用zero based Compressed Oops，这样性能才是最佳的。

*FQ

Starting elasticsearch: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000085330000, 2060255232, 0) failed; error='Cannot allocate memory' (errno=12)
        #
        # There is insufficient memory for the Java Runtime Environment to continue.
        # Native memory allocation (mmap) failed to map 2060255232 bytes for committing reserved memory.
        # An error report file with more information is saved as:
        # /tmp/hs_err_pid2616.log

默认使用的内存大小为2G，虚拟机没有那么多的空间
解决方法：vim /etc/elasticsearch/jvm.options
-Xms512m
-Xmx512m

[2018-12-5T19:19:01,641][INFO ][o.e.b.BootstrapChecks    ] [elk-1] bound or publishing to a non-loopback or non-link-local address, enforcing bootstrap checks
[2018-12-5T19:19:01,658][ERROR][o.e.b.Bootstrap          ] [elk-1] node validation exception
        [1] bootstrap checks failed
        [1]: system call filters failed to install; check the logs and fix your configuration or disable system call filters at your own risk

解决方法： vim /etc/elasticsearch/elasticsearch.yml
bootstrap.system_call_filter: false

[2018-12-06T19:19:01,641][INFO ][o.e.b.BootstrapChecks    ] [elk-1] bound or publishing to a non-loopback or non-link-local address, enforcing bootstrap checks
[2018-12-06T19:19:01,658][ERROR][o.e.b.Bootstrap          ] [elk-1] node validation exception
[1] bootstrap checks failed
[1]: system call filters failed to install; check the logs and fix your configuration or disable system call filters at your own risk

解决：修改配置文件，在配置文件添加一项参数（目前还没明白此参数的作用）
vim /etc/elasticsearch/elasticsearch.yml
bootstrap.system_call_filter: false