High Availability With QJM详细部署步骤

hadoop集群搭建：

配置hosts （4个节点一致）

192.168.83.11  hd1 
192.168.83.22  hd2
192.168.83.33  hd3
192.168.83.44  hd4

配置主机名（重启生效）

[hadoop@hd1 ~]$ more /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=hd1

配置用户用户组

[hadoop@hd1 ~]$ id hadoop
uid=1001(hadoop) gid=10010(hadoop) groups=10010(hadoop)

配置JDK

[hadoop@hd1 ~]$ env|grep JAVA
JAVA_HOME=/usr/java/jdk1.8.0_11
[hadoop@hd1 ~]$ java -version
java version "1.8.0_11"
Java(TM) SE Runtime Environment (build 1.8.0_11-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.11-b03, mixed mode)

配置ssh免密登录

ssh-keygen -t rsa
ssh-keygen -t dsa
cat ~/.ssh/*.pub > ~/.ssh/authorizedkeys
scp  ~/.ssh/authorizedkeys hdoop@hd2:/.ssh/authorizedkeys

配置环境变量：

export JAVA_HOME=/usr/java/jdk1.8.0_11
export JRE_HOME=/usr/java/jdk1.8.0_11/jre
export CLASSPATH=.:$CLASSPATH:$JAVA_HOME/lib:$JRE_HOME/lib

export HADOOP_INSTALL=/home/hadoop/hadoop-2.7.1
export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin
export PATRH=$PATH:/home/hadoop/zookeeper-3.4.6/bin

注意：hadoopo有一个小bug，在~/.bash_profile配置JAVA_HOME不生效，只能在hadoop-env.sh配置JAVA_HOME

# The java implementation to use.
#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/usr/java/jdk1.8.0_11

软件准备：

[hadoop@hd1 software]$ ls -l 
total 931700
-rw-r--r-- 1 hadoop hadoop  17699306 Oct  6 17:30 zookeeper-3.4.6.tar.gz
-rw-r--r--.  1 hadoop hadoop 336424960 Jul 18 23:13 hadoop-2.7.1.tar

tar -xvf /usr/hadoop/hadoop-2.7.1.tar -C /home/hadoop/

tar -xvf ../software/zookeeper-3.4.6.tar.gz -C /home/hadoop/

hadoop HA配置

参考 http://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html。

目的：“Using the Quorum Journal Manager or Conventional Shared Storage”，使用JN（ Quorum JournalNode）就是为了解决共享存储的问题，当然官方也推荐使用NFS，不过本人觉得NFS存在性能问题，不敢使用。

Architecture

官方文档有详细介绍结构，看的比较费劲，转载 yameing 的CSDN 博客片段（全文地址请点击：https://blog.csdn.net/yameing/article/details/39696151?utm_source=copy。）

在一个普通的高可用集群里，有两台独立机器被配置为NN。在任何时间里，只有一个是处于活动状态，而另一个则处于备用状态。活动NN负责所有客户端通信，同时，备用NN只是一个简单的从节点，维护一个为了在需要时能快速故障恢复的状态。

为了备用NN能通过活动NN同步状态，两个节点通过一组独立进程JN进行通信。任何执行在活动NN的edits，将持久地记录到大多数JN里。备用NN能够在这些JN里读取到edits，并且不断的监控记录的改变。当备用NN读取到这些edits时，就把它们执行一遍，就保证两个NN同步。发现故障恢复时，备份NN在确保从JN中读取到所有edits后，就将自己提升为活动NN。这就确保了再发生故障恢复前命名空间已完全同步。

为了提供快速的故障恢复，备用NN拥有最新的块地址信息也是非常重要的。为了实现这个要求，DN同时配置有两个NN的地址，并且同时向两者发送块地址信息和心跳。

在同一时间里，保证高可用集群中只有一个活动NN是至关重要的。否则，两个NN的状态将很快出现不一致，数据有丢失的风险，或者其他错误的结果。为了确保这种属性、防止所谓的脑裂场景（split-brain scenario），在同一时间里，JN只允许一个NN写edits。故障恢复期间，将成为活动节点的NN简单的获取写edits的角色，这将有效的阻止其他NN继续处于活动状态，允许新活动节点安全的进行故障恢复。

节点及实例规划：

NameNode 机器：运行活动NN和备用NN的硬件配置应该是一致的。这和非高可用集群的配置一样。

l JournalNode 机器：JN进程相对轻量级，所以这些进程可以合理的配置在Hadoop集群的其他机器里，如NN，JT、RM。注意：必须至少有3个JN进程，因为edits需要写入到大多数的JN里。这就允许系统单台机器的错误。你也可以运行3个以上JN，但为了实际提高系统对错误的容忍度，最好运行奇数个JN。执行N个JN的集群上，系统可以容忍(N-1)/2台机器错误的同时保持正常工作。

注意：在高可用集群里，备用NN也扮演checkpoint，所以没必要再运行一个Secondary NN，CheckpointNode，或BackupNode。事实上，这样做（运行上述几个节点）是一种错误。这也就允许复用原来指定为Secondary Namenode 的硬件，将一个非高可用的HDFS集群重新配置为高可用的。

Configuration details

To configure HA NameNodes, you must add several configuration options to your hdfs-site.xml configuration file.

dfs.nameservices - the logical name for this new nameservice

<property>
  <name>dfs.nameservices</name>
  <value>mycluster</value>
</property>

dfs.ha.namenodes.[nameservice ID] - unique identifiers for each NameNode in the nameservice

<property>
  <name>dfs.ha.namenodes.mycluster</name>
  <value>nn1,nn2</value>
</property>

dfs.namenode.rpc-address.[nameservice ID].[name node ID] - the fully-qualified RPC address for each NameNode to listen on

<property>
  <name>dfs.namenode.rpc-address.mycluster.nn1</name>
  <value>machine1.example.com:8020</value>
</property>
<property>
  <name>dfs.namenode.rpc-address.mycluster.nn2</name>
  <value>machine2.example.com:8020</value>
</property>

dfs.namenode.http-address.[nameservice ID].[name node ID] - the fully-qualified HTTP address for each NameNode to listen on

<property>
  <name>dfs.namenode.http-address.mycluster.nn1</name>
  <value>machine1.example.com:50070</value>
</property>
<property>
  <name>dfs.namenode.http-address.mycluster.nn2</name>
  <value>machine2.example.com:50070</value>
</property>

dfs.namenode.shared.edits.dir - the URI which identifies the group of JNs where the NameNodes will write/read edits

<property>
  <name>dfs.namenode.shared.edits.dir</name>
  <value>qjournal://node1.example.com:8485;node2.example.com:8485;node3.example.com:8485/mycluster</value>
</property>

dfs.client.failover.proxy.provider.[nameservice ID] - the Java class that HDFS clients use to contact the Active NameNode

<property>
  <name>dfs.client.failover.proxy.provider.mycluster</name>
  <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>

dfs.ha.fencing.methods - a list of scripts or Java classes which will be used to fence the Active NameNode during a failover.

Importantly, when using the Quorum Journal Manager, only one NameNode will ever be allowed to write to the JournalNodes, so there is no potential for corrupting the file system metadata from a split-brain scenario.

sshfence - SSH to the Active NameNode and kill the process

    <property>
      <name>dfs.ha.fencing.methods</name>
      <value>sshfence</value>
    </property>

    <property>
      <name>dfs.ha.fencing.ssh.private-key-files</name>
      <value>/home/exampleuser/.ssh/id_rsa</value>
    </property>

fs.defaultFS - the default path prefix used by the Hadoop FS client when none is given.in your core-site.xml file:

<property>
  <name>fs.defaultFS</name>
  <value>hdfs://mycluster</value>
</property>

dfs.journalnode.edits.dir - the path where the JournalNode daemon will store its local state

<property>
  <name>dfs.journalnode.edits.dir</name>
  <value>/path/to/journal/node/local/data</value>
</property>

配置Datanode

[hadoop@hd1 hadoop]$ more slaves 
hd2
hd3
hd4

以上配置完成，可以把配置文件拷贝到其他的节点，完成hadoop集群配置工作。

启动：

启动JN:running the command“hadoop-daemon.sh start journalnode”。
如果是一个新的集群，需要先格式化NN hdfs namenode -format 在其中一个NN节点上。
如果已经格式化完成，需要拷贝NN元数据到另外的节点，这个时候需要在未格式化的NN节点上执行“hdfs namenode -bootstrapStandby”。(注意：在拷贝元数据之前，需要提前启动format过的NN,只启动一个节点)，启动已经格式化的节点的NN hadoop-daemon.sh start namenode
如果把一个非HA转换成HA,需要执行“hdfs namenode -initializeSharedEdits”

Note: This is not yet implemented, and at present will always return success, unless the given NameNode is completely down.

格式化NN hdfs namenode -format，报错：

18/10/07 21:42:34 INFO ipc.Client: Retrying connect to server: hd2/192.168.83.22:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/10/07 21:42:34 WARN namenode.NameNode: Encountered exception during format: 
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Unable to check if JNs are ready for formatting. 1 exceptions thrown:
192.168.83.22:8485: Call From hd1/192.168.83.11 to hd2:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
        at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
        at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
        at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.hasSomeData(QuorumJournalManager.java:232)
        at org.apache.hadoop.hdfs.server.common.Storage.confirmFormat(Storage.java:900)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.confirmFormat(FSImage.java:184)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:987)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1429)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1554)
18/10/07 21:42:34 INFO ipc.Client: Retrying connect to server: hd3/192.168.83.33:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

解决：在格式化NN的时候，需要连接JN,如果连接JN失败或者超时都会出现这种错误，首先检查JN是否启动，如果由于网络延迟导致可以通过设置timeout来规避这个错误。

<!--修改core-site.xml中的ipc参数,防止出现连接journalnode服务ConnectException-->
 2 <property>
 3     <name>ipc.client.connect.max.retries</name>
 4     <value>100</value>
 5     <description>Indicates the number of retries a client will make to establish a server connection.</description>
 6 </property>
 7 <property>
 8     <name>ipc.client.connect.retry.interval</name>
 9     <value>10000</value>
10     <description>Indicates the number of milliseconds a client will wait for before retrying to establish a server connection.</description>
11 </property>

---------------------

本文来自 锐湃 的CSDN 博客 ，全文地址请点击：https://blog.csdn.net/chuyouyinghe/article/details/78976933?utm_source=copy

注意：

　　1) 仅对于这种由于服务没有启动完成造成连接超时的问题，都可以调整core-site.xml中的ipc参数来解决。如果目标服务本身没有启动成功，这边调整ipc参数是无效的。

　　2) 该配置使namenode连接journalnode最大时间增加至1000s(maxRetries=100, sleepTime=10000),假如集群节点数过多，或者网络情况不稳定，造成连接时间超过1000s,仍会导致namenode挂掉。

Automatic Failover：

上面介绍了如何配置人工故障恢复。这种方式下，即使活动NN挂掉了，系统不会自动触发负责恢复。下面描述如何配置和部署自动故障恢复。

Automatic failover adds two new components to an HDFS deployment: a ZooKeeper quorum, and the ZKFailoverController process (abbreviated as ZKFC).

自动故障恢复增加了两个组件：Zookeeper quorum、ZKFailoverController（ZKFC）。

Apache ZooKeeper is a highly available service for maintaining small amounts of coordination data, notifying clients of changes in that data, and monitoring clients for failures. The implementation of automatic HDFS failover relies on ZooKeeper for the following things:

Introduction

Apache Zookeeper是一个高可用的服务，它能维护少量的协调数据，通知客户数据的变化，监控客户端失败。HDFS故障自动恢复依赖ZK以下特性：

Failure detection - each of the NameNode machines in the cluster maintains a persistent session in ZooKeeper. If the machine crashes, the ZooKeeper session will expire, notifying the other NameNode that a failover should be triggered.

故障检测（Failure detection） - 集群里的每个NN与ZK保持一个持久会话。如果机器宕机，ZK会话将过期，然后提醒其他NN从而触发一个故障恢复。

Active NameNode election - ZooKeeper provides a simple mechanism to exclusively select a node as active. If the current active NameNode crashes, another node may take a special exclusive lock in ZooKeeper indicating that it should become the next active.

活动NN选举（Active NameNode election） - ZK提供一个简单的机制专门用来选举活动节点。如果当前的活动NN宕机，另外一个节点会拿到一个在ZK里的特殊的独占锁，这表示这个节点将会成为下一个活动节点。

The ZKFailoverController (ZKFC) is a new component which is a ZooKeeper client which also monitors and manages the state of the NameNode. Each of the machines which runs a NameNode also runs a ZKFC, and that ZKFC is responsible for:

ZKFC是一个新的组件，它是一个ZK客户端，同时监听和管理NN的状态。运行NN的机器上须同时运行ZKFC，它的责任是：

Health monitoring - the ZKFC pings its local NameNode on a periodic basis with a health-check command. So long as the NameNode responds in a timely fashion with a healthy status, the ZKFC considers the node healthy. If the node has crashed, frozen, or otherwise entered an unhealthy state, the health monitor will mark it as unhealthy.

健康监控（Health monitoring） - ZKFC使用“health-check”命令定期ping本地的NN。只要NN及时的响应一个健康状态，则认为这个节点是健康的。如果节点宕机，无响应或者进入了其他不健康状态，健康监控器认为它是不健康的。

ZooKeeper session management - when the local NameNode is healthy, the ZKFC holds a session open in ZooKeeper. If the local NameNode is active, it also holds a special “lock” znode. This lock uses ZooKeeper’s support for “ephemeral” nodes; if the session expires, the lock node will be automatically deleted.

ZK会话管理（ZooKeeper session management） - 当本地NN是健康的，ZKFC持有ZK的一个打开的会话。如果本地NN是活动状态，ZKFC同时持有一个特殊的锁节点（a special "lock" znode）。这个锁使用了ZK的临时节点。如果会话过期，这个锁节点将被自动删除。

ZooKeeper-based election - if the local NameNode is healthy, and the ZKFC sees that no other node currently holds the lock znode, it will itself try to acquire the lock. If it succeeds, then it has “won the election”, and is responsible for running a failover to make its local NameNode active. The failover process is similar to the manual failover described above: first, the previous active is fenced if necessary, and then the local NameNode transitions to active state.

基于ZK的选举（ZooKeeper-based election） - 如果本地节点是健康的并且ZKFC发现当前没有节点持有锁，它就尝试获取这个锁。如果成功，它就赢得了选举，执行一次故障恢复以使自己成为活动NN。这个故障恢复过程和前面介绍的人工故障恢复是相似的：1、隔离前活动节点（如果需要）；2、本地NN转换成活动节点

Deploying ZooKeeper

Before you begin configuring automatic failover, you should shut down your cluster. It is not currently possible to transition from a manual failover setup to an automatic failover setup while the cluster is running.

注意：在开始配置自动故障恢复前，关闭你的集群。目前还不支持在集群运行时将人工故障恢复转换为自动故障恢复。

Installer ZooKeeper：

ZK 集群模式部署

参考文档：https://zookeeper.apache.org/doc/r3.4.6/zookeeperStarted.html

解压 zookeeper-3.4.6.tar.gz

vi conf/zoo.cfg

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
dataDir=/home/hadoop/zookeeper-3.4.6/tmp
# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
#maxClientCnxns=60
#
# Be sure to read the maintenance section of the
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1
server.1=hd1:2888:3888
server.2=hd2:2888:3888
server.3=hd3:2888:3888

配置ServerID

The entries of the form server.X list the servers that make up the ZooKeeper service. When the server starts up, it knows which server it is by looking for the file myid in the data directory. That file has the contains the server number,

server.X标记ZK服务，当服务启动，他会去DataDir目录下寻找一个myid的文件，这个文件包括server.X，

server.1,.2,.3是zookeeper Server.id

mkdir -p /usr/hadoop/zookeeper-3.4.6/tmp

vi /usr/hadoop/zookeeper-3.4.6/tmp/myid

[hadoop@hd1 tmp]$ more myid 
1

hd1 的 /usr/hadoop/zookeeper-3.4.6/tmp/myid文件写入1，hd2写入2，hd3写入3，依此类推，，，
以上完成ZK的配置工作，可以把配置文件拷贝到其他ZK节点，完成ZK集群的配置。

启动ZK :

[hadoop@hd1 bin]$ sh zkServer.sh start
JMX enabled by default
Using config: /home/hadoop/zookeeper-3.4.6/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
[hadoop@hd1 bin]$ jps
1957 QuorumPeerMain
1976 Jps

Configuring automatic failover：

In your hdfs-site.xml file, add:

 <property>
   <name>dfs.ha.automatic-failover.enabled</name>
   <value>true</value>
 </property>

In your core-site.xml file, add:

 <property>
   <name>ha.zookeeper.quorum</name>
   <value>hd1:2181,hd2:2181,hd3:2181</value>
 </property>

This lists the host-port pairs running the ZooKeeper service.

如上地址-端口应该运行着ZK服务。

Initializing HA state in ZooKeeper

After the configuration keys have been added, the next step is to initialize required state in ZooKeeper. You can do so by running the following command from one of the NameNode hosts.

以上配置都完成之后下一步就是初始化ZK,可以运行如下命令完成初始化：

hdfs zkfc -formatZK

This will create a znode in ZooKeeper inside of which the automatic failover system stores its data.

Starting the cluster with “start-dfs.sh”

Since automatic failover has been enabled in the configuration, the start-dfs.sh script will now automatically start a ZKFC daemon on any machine that runs a NameNode. When the ZKFCs start, they will automatically select one of the NameNodes to become active.

HA 启动总结

启动JN（hd2,hd3，hd4）：

[hadoop@hd4 ~]$ hadoop-daemon.sh  start journalnode
starting journalnode, logging to /usr/hadoop/hadoop-2.7.1/logs/hadoop-hadoop-journalnode-hd4.out
[hadoop@hd4 ~]$ jps
1843 JournalNode
1879 Jps

NN格式化（任意一台NN执行，hd1,hd2 ）：

[hadoop@hd1 sbin]$ hdfs namenode -format 
18/10/07 05:54:30 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = hd1/192.168.83.11
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.7.1
........
18/10/07 05:54:34 INFO namenode.FSImage: Allocated new BlockPoolId: BP-841723191-192.168.83.11-1538862874971
18/10/07 05:54:34 INFO common.Storage: Storage directory /usr/hadoop/hadoop-2.7.1/dfs/name has been successfully formatted.
18/10/07 05:54:35 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
18/10/07 05:54:35 INFO util.ExitUtil: Exiting with status 0
18/10/07 05:54:35 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hd1/192.168.83.11
************************************************************/

NN元数据拷贝（注意：在拷贝元数据之前，需要提前启动format过的NN,只启动一个节点）
启动format过的NN

[hadoop@hd1 current]$ hadoop-daemon.sh start namenode 
starting namenode, logging to /usr/hadoop/hadoop-2.7.1/logs/hadoop-hadoop-namenode-hd1.out
[hadoop@hd1 current]$ jps
1777 QuorumPeerMain
2177 Jps

在未format NN节点（hd2）执行元数据拷贝命令

[hadoop@hd2 ~]$ hdfs namenode -bootstrapStandby
18/10/07 06:07:15 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = localhost.localdomain/127.0.0.1
STARTUP_MSG:   args = [-bootstrapStandby]
STARTUP_MSG:   version = 2.7.1
。。。。。。。。。。。。。。。。。。。。。。
************************************************************/
18/10/07 06:07:15 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
18/10/07 06:07:15 INFO namenode.NameNode: createNameNode [-bootstrapStandby]
18/10/07 06:07:16 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
=====================================================
About to bootstrap Standby ID nn2 from:
           Nameservice ID: mycluster
        Other Namenode ID: nn1
  Other NN's HTTP address: http://hd1:50070
  Other NN's IPC  address: hd1/192.168.83.11:8020
             Namespace ID: 1626081692
            Block pool ID: BP-841723191-192.168.83.11-1538862874971
               Cluster ID: CID-230e9e54-e6d1-4baf-a66a-39cc69368ed8
           Layout version: -63
       isUpgradeFinalized: true
=====================================================
18/10/07 06:07:17 INFO common.Storage: Storage directory /usr/hadoop/hadoop-2.7.1/dfs/name has been successfully formatted.
18/10/07 06:07:18 INFO namenode.TransferFsImage: Opening connection to http://hd1:50070/imagetransfer?getimage=1&txid=0&storageInfo=-63:1626081692:0:CID-230e9e54-e6d1-4baf-a66a-39cc69368ed8
18/10/07 06:07:18 INFO namenode.TransferFsImage: Image Transfer timeout configured to 60000 milliseconds
18/10/07 06:07:18 INFO namenode.TransferFsImage: Transfer took 0.01s at 0.00 KB/s
18/10/07 06:07:18 INFO namenode.TransferFsImage: Downloaded file fsimage.ckpt_0000000000000000000 size 353 bytes.
18/10/07 06:07:18 INFO util.ExitUtil: Exiting with status 0
18/10/07 06:07:18 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost.localdomain/127.0.0.1
************************************************************/

启动ZK

hdfs zkfc -formatZK

格式化ZK报错：

8/10/07 22:34:06 INFO zookeeper.ClientCnxn: Opening socket connection to server hd1/192.168.83.11:2181. Will not attempt to authenticate using SASL (unknown error)
18/10/07 22:34:06 INFO zookeeper.ClientCnxn: Socket connection established to hd1/192.168.83.11:2181, initiating session
18/10/07 22:34:06 INFO zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
18/10/07 22:34:07 INFO zookeeper.ClientCnxn: Opening socket connection to server hd2/192.168.83.22:2181. Will not attempt to authenticate using SASL (unknown error)
18/10/07 22:34:07 INFO zookeeper.ClientCnxn: Socket connection established to hd2/192.168.83.22:2181, initiating session
18/10/07 22:34:07 INFO zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
18/10/07 22:34:07 ERROR ha.ActiveStandbyElector: Connection timed out: couldn't connect to ZooKeeper in 5000 milliseconds
18/10/07 22:34:07 INFO zookeeper.ZooKeeper: Session: 0x0 closed
18/10/07 22:34:07 INFO zookeeper.ClientCnxn: EventThread shut down
18/10/07 22:34:07 FATAL ha.ZKFailoverController: Unable to start failover controller. Unable to connect to ZooKeeper quorum at hd1:2181,hd2:2181,hd3:2181. Please check the configured value for ha.zookeeper.quorum and ensure that ZooKeeper is running.
[hadoop@hd1 ~]$

Looks like your zookeeper quorum was not able to elect a master. Maybe you have misconfigured your zookeeper?

Make sure that you have entered all 3 servers in your zoo.cfg with a unique ID. Make sure you have the same config on all 3 of your machines and and make sure that every server is using the correct myId as specified in the cfg.

修改之后重新执行：

[hadoop@hd1 bin]$ hdfs zkfc -formatZK
18/10/09 20:27:21 INFO zookeeper.ZooKeeper: Client environment:java.library.path=/home/hadoop/hadoop-2.7.1/lib/native
18/10/09 20:27:21 INFO zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp
18/10/09 20:27:21 INFO zookeeper.ZooKeeper: Client environment:java.compiler=<NA>
18/10/09 20:27:21 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux
18/10/09 20:27:21 INFO zookeeper.ZooKeeper: Client environment:os.arch=amd64
18/10/09 20:27:21 INFO zookeeper.ZooKeeper: Client environment:os.version=2.6.32-504.el6.x86_64
18/10/09 20:27:21 INFO zookeeper.ZooKeeper: Client environment:user.name=hadoop
18/10/09 20:27:21 INFO zookeeper.ZooKeeper: Client environment:user.home=/home/hadoop
18/10/09 20:27:21 INFO zookeeper.ZooKeeper: Client environment:user.dir=/home/hadoop/zookeeper-3.4.6/bin
18/10/09 20:27:21 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=hd1:2181,hd2:2181,hd3:2181 sessionTimeout=5000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5119fb47
18/10/09 20:27:21 INFO zookeeper.ClientCnxn: Opening socket connection to server hd1/192.168.83.11:2181. Will not attempt to authenticate using SASL (unknown error)
18/10/09 20:27:22 INFO zookeeper.ClientCnxn: Socket connection established to hd1/192.168.83.11:2181, initiating session
18/10/09 20:27:22 INFO zookeeper.ClientCnxn: Session establishment complete on server hd1/192.168.83.11:2181, sessionid = 0x16658c662c80000, negotiated timeout = 5000
18/10/09 20:27:22 INFO ha.ActiveStandbyElector: Session connected.
18/10/09 20:27:22 INFO ha.ActiveStandbyElector: Successfully created /hadoop-ha/mycluster in ZK.
18/10/09 20:27:22 INFO zookeeper.ZooKeeper: Session: 0x16658c662c80000 closed
18/10/09 20:27:22 INFO zookeeper.ClientCnxn: EventThread shut down

启动ZK：

[hadoop@hd1 bin]$ ./zkServer.sh start 
JMX enabled by default
Using config: /home/hadoop/zookeeper-3.4.6/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED

启动所有：

[hadoop@hd1 bin]$ start-dfs.sh
18/10/09 20:36:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [hd1 hd2]
hd2: namenode running as process 2065. Stop it first.
hd1: namenode running as process 2011. Stop it first.
hd2: starting datanode, logging to /home/hadoop/hadoop-2.7.1/logs/hadoop-hadoop-datanode-hd2.out
hd4: starting datanode, logging to /home/hadoop/hadoop-2.7.1/logs/hadoop-hadoop-datanode-hd4.out
hd3: starting datanode, logging to /home/hadoop/hadoop-2.7.1/logs/hadoop-hadoop-datanode-hd3.out
Starting journal nodes [hd2 hd3 hd4]
hd4: journalnode running as process 1724. Stop it first.
hd2: journalnode running as process 1839. Stop it first.
hd3: journalnode running as process 1725. Stop it first.
18/10/09 20:37:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting ZK Failover Controllers on NN hosts [hd1 hd2]
hd1: zkfc running as process 3045. Stop it first.
hd2: zkfc running as process 2601. Stop it first.

查看hd2 DN日志：

[hadoop@hd2 logs]$ jps
1984 QuorumPeerMain
2960 Jps
2065 NameNode
2601 DFSZKFailoverController
1839 JournalNode

2018-10-09 20:37:07,674 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /home/hadoop/hadoop-2.7.1/dfs/data/in_use.lock acquired by nodename 2787@hd2
2018-10-09 20:37:07,674 WARN org.apache.hadoop.hdfs.server.common.Storage: java.io.IOException: Incompatible clusterIDs in /home/hadoop/hadoop-2.7.1/dfs/data: namenode clusterID = CID-e28f1182-d452
-4f23-9b37-9a59d4bdeaa0; datanode clusterID = CID-876d5634-38e8-464c-be02-714ee8c72878
2018-10-09 20:37:07,675 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to hd2/192.168.83.22:8020. Exiti
ng. 
java.io.IOException: All specified directories are failed to load.
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:477)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1361)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1326)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:316)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:223)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:801)
        at java.lang.Thread.run(Thread.java:745)
2018-10-09 20:37:07,676 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to hd1/192.168.83.11:8020. Exiti
ng. 
java.io.IOException: All specified directories are failed to load.
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:477)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1361)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1326)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:316)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:223)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:801)
        at java.lang.Thread.run(Thread.java:745)
2018-10-09 20:37:07,683 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool <registering> (Datanode Uuid unassigned) service to hd1/192.168.83.11:8020
2018-10-09 20:37:07,684 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool <registering> (Datanode Uuid unassigned) service to hd2/192.168.83.22:8020
2018-10-09 20:37:07,687 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool <registering> (Datanode Uuid unassigned)
2018-10-09 20:37:09,688 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
2018-10-09 20:37:09,689 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 0
2018-10-09 20:37:09,698 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at hd2/192.168.83.22
************************************************************/

发现hd2 DN没有启动，从日志可以看出 namenode clusterID = CID-e28f1182-d452
-4f23-9b37-9a59d4bdeaa0; datanode clusterID = CID-876d5634-38e8-464c-be02-714ee8c72878 NN_ID和DN_ID不一致导致启动失败。回想自己的操作，由于NN重复格式化导致NN_ID 发生变化，而DN_ID 没有变化导致不一致，解决办法很简单把DN 数据删除重新启动DN。

重新查看hd2节点：

[hadoop@hd2 dfs]$ jps
1984 QuorumPeerMain
2065 NameNode
3123 DataNode
3268 Jps
2601 DFSZKFailoverController
1839 JournalNode

到目前为止，各个节点实例都启动完毕，现在罗列一下：
hd1:

[hadoop@hd1 bin]$ jps
4180 Jps
3045 DFSZKFailoverController
2135 QuorumPeerMain
2011 NameNode

hd2:

[hadoop@hd2 dfs]$ jps
1984 QuorumPeerMain
2065 NameNode
3123 DataNode
3268 Jps
2601 DFSZKFailoverController
1839 JournalNode

hd3:

[hadoop@hd3 bin]$ jps
2631 Jps
2523 DataNode
1725 JournalNode
1807 QuorumPeerMain

hd4:

[hadoop@hd4 ~]$ jps
2311 DataNode
2425 Jps
1724 JournalNode

通过web界面访问NN(任意一个NN):

http://192.168.83.11:50070

http://192.168.83.22:50070

[hadoop@hd1 bin]$ hdfs dfs -put zookeeper.out /
18/10/09 21:11:01 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@hd1 bin]$ hdfs dfs -ls /
18/10/09 21:11:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r-- 3 hadoop supergroup 25698 2018-10-09 21:11 /zookeeper.out

下面配置MR

mapred-site.xml

<configuration>
<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
</property>
</configuration>

yarn-site.xml

<property>
  <name>yarn.resourcemanager.hostname</name>
  <value>hd1</value>
</property>
<property>
    <name>yarn.resourcemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
</configuration>

web界面管理MR:

http://hd1:8088/

查看服务默认端口可以在 http://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/ClusterSetup.html 下面的configuration 配置项找。

NN手动管理：

[hadoop@hd1 bin]$ hdfs haadmin
Usage: haadmin
    [-transitionToActive [--forceactive] <serviceId>]
    [-transitionToStandby <serviceId>]    --前面定义的nn1,nn2
    [-failover [--forcefence] [--forceactive] <serviceId> <serviceId>]
    [-getServiceState <serviceId>]
    [-checkHealth <serviceId>]
    [-help <command>]

Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|resourcemanager:port>    specify a ResourceManager
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

This guide describes high-level uses of each of these subcommands. For specific usage information of each subcommand, you should run “hdfs haadmin -help <command>”.

transitionToActive and transitionToStandby - transition the state of the given NameNode to Active or Standby

These subcommands cause a given NameNode to transition to the Active or Standby state, respectively. These commands do not attempt to perform any fencing, and thus should rarely be used. Instead, one should almost always prefer to use the “hdfs haadmin -failover” subcommand.
failover - initiate a failover between two NameNodes

This subcommand causes a failover from the first provided NameNode to the second. If the first NameNode is in the Standby state, this command simply transitions the second to the Active state without error. If the first NameNode is in the Active state, an attempt will be made to gracefully transition it to the Standby state. If this fails, the fencing methods (as configured by dfs.ha.fencing.methods) will be attempted in order until one succeeds. Only after this process will the second NameNode be transitioned to the Active state. If no fencing method succeeds, the second NameNode will not be transitioned to the Active state, and an error will be returned.
getServiceState - determine whether the given NameNode is Active or Standby

Connect to the provided NameNode to determine its current state, printing either “standby” or “active” to STDOUT appropriately. This subcommand might be used by cron jobs or monitoring scripts which need to behave differently based on whether the NameNode is currently Active or Standby.
checkHealth - check the health of the given NameNode

Connect to the provided NameNode to check its health. The NameNode is capable of performing some diagnostics on itself, including checking if internal services are running as expected. This command will return 0 if the NameNode is healthy, non-zero otherwise. One might use this command for monitoring purposes.

Note: This is not yet implemented, and at present will always return success, unless the given NameNode is completely down.

来源：oschina

链接：https://my.oschina.net/u/3862440/blog/2223406

标签

HDFS

ZooKeeper

high-availability