上QQ阅读APP看书，第一时间看更新

Architecture of RM high availability

The Resouces Manager high availability also works on the Active/Standby architecture where the standby resource manager will take over control upon receiving the signal from the ZooKeeper. The following diagram shows the high level design of Resouces Manager high availability:

Components of Resouces Manager HA:

Resource Manager state store: As discussed previously, it is important to store the state of the Resource Manager. So in case of failure, the standby resource manager will reload the state from the store during startup and will start from the last execution point. The Resource Manager state store provides the ability to store the internal state of the Resource Manager such as application and its attempts, version, tokens, and so on. There is no need to store the cluster information as it will be reconstructed when the Node Manager sends a heartbeat to the new Resource Manager. It provides file-based and ZooKeeper based state store implementation.
Resource Manager restart and failover: The Resource Manager loads the internal application state from the Resource Manager state store. The scheduler of the Resource Manager reconstructs its state of cluster information when the Node Manager sends heartbeats. The Resource Manager makes a re-attempt for an application submitted to the failed resource manager. The checkpoint process allows the Resource Manager to only work on failed, running, or pending tasks and avoids restarting already completed task that helps to save significant time cost.
Failover fencing: In high availability YARN clusters, there can be two or more than two Resource Managers in active/standby mode. It is possible at times that two managers assume themselves as active and that will lead to a split brain situation. When that happens, both resource managers will control cluster resources and handle client requests. The failover fencing mechanism enables the active Resource Manager to restrict other resource managers' operation. The state store that we discussed previously provides the ZooKeeper based state store ZKResourceManagerStateStore, which only allows a single Resource Manager to write to it at a time. It does so by maintaining an ACL where only the active Resource Manager will have create-delete access and the other will only have read-admin access.
Leader elector: The ZooKeeper-based leader elector ActiveStandbyElector is used to elect a new active Resource Manager and implements fencing internally. When the current active Resource Manager goes down, a new resource manager will be elected by ActiveStandbyElector and take over the control. If automatic failover is not enabled, then admin has to manually make the transition of the active RM to standby and the other way around.