Mastering Hadoop 3
上QQ阅读APP看书,第一时间看更新

HDFS high availability in Hadoop 3.x

With Hadoop 2.0, active and standby NameNodes were introduced. At any point, out of two NameNodes, one will always be in active state and other will be in standby state. The active NameNode is the one that's responsible for any client requests in the cluster. Standby NameNodes are slave nodes whose responsibility is to keep its state in sync with the active NameNode so that it can provide fast failover in the event of failover. However, what if one of the NameNodes fails? In that case, the NameNode would become non-HA. This means that NameNodes can only tolerate up to one failure. This behavior is the opposite of the core fault -tolerant behavior of Hadoop, which certainly can accommodate more than one failure of DataNodes in a cluster. Keeping that in mind, provisions of more than one standby NameNode was introduced in Hadoop 3. The behavior of additional standby NameNodes will still be the same as any other standby NameNode. They will have their own IDs, RPC, and HTTP addresses. They will use QJM to get the latest edit logs and update their fsimage.

The following are the core configurations that are required for HA in NameNode:

  • First, you need to define the nameservice for the cluster:
      <property>
         <name>dfs.nameservices</name>
         <value>mycluster</value>
       </property>
  • Then, you have to give the IDs of all the NameNodes in the named service, mycluster , which we defined previously:
       <property>
         <name>dfs.ha.namenodes.mycluster</name>
         <value>nn1,nn2,nn3</value>
       </property>
  • After giving the identifiers to the NameNodes, you need to add RPC and HTTP addresses for those NameNodes. Here, we will define RPC and HTTP addresses for nn1, nn2, and nn3:
        <property>
          <name>dfs.namenode.rpc-address.mycluster.nn1</name>
          <value>masternode1.example.com:9820</value>
        </property>
        <property>
          <name>dfs.namenode.rpc-address.mycluster.nn2</name>
          <value>masternode2.example.com:9820</value>
        </property>
        <property>
          <name>dfs.namenode.rpc-address.mycluster.nn3</name>
          <value>masternode3.example.com:9820</value>
        </property>

<property> <name>dfs.namenode.http-address.mycluster.nn1</name> <value>masternode1.example.com:9870</value> </property> <property> <name>dfs.namenode.http-address.mycluster.nn2</name> <value>masternode2.example.com:9870</value> </property> <property> <name>dfs.namenode.http-address.mycluster.nn3</name> <value>masternode3.example.com:9870</value> </property>
The preceding configurations are just a small snippet of what more than NameNodes HA configuration would look like in Hadoop 3. If you are looking for comprehensive configuration steps for HA, then you should refer to this link:
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html.