上QQ阅读APP看书，第一时间看更新

Resource Manager high availability

Resource Manager (RM) is the single point of failure in a YARN cluster as every request from a client goes through it. The Resource Manager also acts as a central system to allocate resources for various tasks. The failure of the resource manager will lead to failure of YARN and thus a client cannot obtain any information about the YARN cluster or a client cannot submit any application for execution. Therefore, it is important to implement high availability of Resource Manager to prevent any cluster failure. The following are a few important considerations for high availability:

Resource Manager state: It is very important to persist a resource manager state, which if stored in memory may be lost upon resource manager failure. If the state of the Resource Manager is available even after failure, we can restart the Resource Manager from the last failure point based on the last state.
Running application state: The Resource Manager persistent state store allows YARN to continue to function in the RM restart in a user transparent manner. Once the last state is loaded by the Resource Manager, it will restart all the application masters, kill all the running containers, and start them from a clean state. In this process, the work already done by containers will be lost and it will lead to an increase in the application completion time. There is a need to preserve the state of containers, which upon failure will not require restarting the application masters and killing existing containers.
Automatic failover: Automatic failover refers to the control transfer from a failed resource manager to a standby resource manager. The failover fencing mechanism is a popular method to implement failover as one of the controllers will trigger the failover if the specified condition is met. Remember, transferring control will always require the transfer of the old state to the new resource manager.