Secondary NameNode usage and High availability in Hadoop 2.x

问题

Can you please help me out to the below scenarios.

1) While using Hadoop V2, do we use Secondary NameNode in production environment?

2) For Hadoop V2, suppose we use muliple NameNodes in active/passive connection for High Availability and when the Edits Log file is growing huge,

How does the edits log gets applied to fsimage? If so, then applying the huge Edits log to Namenode would be time consuming during startup of Namenode? (We had Secondary NameNode in hadoop v1 to solve this problem)

回答1:

Answers to your queries:

1) While using Hadoop V2, do we use Secondary NameNode in production environment?

Secondary name node is not required in production environment if you deploy StandByName node for High Availability of Name node.

2) How does the edits log gets applied to fsimage in absence of secondary node?

To answer this query, you have to understand how high availability has been implemented in Hadoop in two different ways. : High Availability with QJM and High Availability with NFS Federation

But in these two approaches, QJM (Quorum Journal Manager) is preferred.

In a typical HA cluster, two separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an Active state, and the other is in a Standby state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.

In order for the Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called “JournalNodes” (JNs).

When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs. The Standby node is reads these edits from the JNs and apply to its own name space.

In the event of a failover, the Standby will ensure that it has read all of the edits from the JounalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.

It is vital for an HA cluster that only one of the NameNodes be Active at a time. ZooKeeper has been used to avoid split brain scenario so that name node state is not getting diverged due to failover.

I have explained failover process for Name node in detailed way at my other StackOverFlow question : How does Hadoop Namenode failover process works?

回答2:

1) While using Hadoop V2, do we use Secondary NameNode in production environment?

It completely depends on how your production environment setup is. In case you are using Hadoop V2 with HA, you don't require Secondary NameNode in production as your Slave NameNode will perform the same tasks as Secondary NameNode in optimum way. But in case your production setup is not leveraging NameNode HA than you have to use Secondary NameNode for checkpointing. Refer Understanding Hadoop 2.x Architecture and it's Demons for more information on this.

2) For Hadoop V2, suppose we use muliple NameNodes in active/passive connection for High Availability and when the Edits Log file is growing huge,

As per my understanding your main concern here is "how edit logs are managed with NameNode HA in Hadoop V2?"

Here is the answer: Edit Logs management can be done with Quorum Journal Manager (QJM) or NFS Shared Storage

With QJM, there are group of demons called JournalNode (JN) are communicating with active NameNode. This group is continuously looking for any updates done by active NameNode and maintain the state. StandBy NameNode is constantly getting the edit log updates from JNs and maintains the updated editlog file.

With NFS Shared Storage, both Active NameNode and StandBy NameNode have the access to the a particular directory on shared storage (i.e Network File System). In case of any updated done by NameNode it logs the event to the shared directory. On the other side StandBy NameNode is looking for the updated on the same shared directory and updates the edit logs simultaneously.

I hope this helps...

来源：https://stackoverflow.com/questions/33494697/secondary-namenode-usage-and-high-availability-in-hadoop-2-x

标签

Hadoop

HDFS

hadoop2