Mastering Hadoop 3
上QQ阅读APP看书,第一时间看更新

Architecture

YARN stands for Yet Another Resource Negotiator, and was introduced with Apache Hadoop 2.0 to address the scalability and manageability issues that existed with the previous versions. In Hadoop 1.0, we have two major components for job execution: JobTracker and task tracker. JobTracker is responsible for managing resources and scheduling jobs. It is also responsible for tracking the status of each job and restarting them if there is any failure. The task trackers are responsible for running tasks and sending progress report to JobTracker. The JobTracker also reschedules failed tasks on different task trackers. As JobTracker could be overloaded with multiple tasks, Hadoop 1.0 made several changes in its architecture to eliminate the following limitations:

  • Scalability: In Hadoop 1.0, the JobTracker is responsible for scheduling the jobs, monitoring each job, and restarting them on failure. It means JobTracker spends the majority of its time managing the application's life cycle. In a larger cluster with more nodes and more tasks, the burden of scheduling and monitoring increases. The work overhead limits the scalability of Hadoop version 1 to 4,000 nodes and 40,000 tasks.
  • High availability: High availability ensures that even if one node serving the request goes down, the other standby active node can assume the responsibility for the failed node. In this case, the state of the failed node should be in sync with the state of the standby active node. The JobTracker is a single point of failure. Every few seconds, task trackers send the information about tasks to the JobTracker, which makes it difficult to implement high availability for the JobTracker because of the large number of changes in a very short span of time. 
  • Memory utilization: Hadoop version 1 required preconfigured task tracker slots for map and reduce tasks. The slot reserved for the map task cannot be used for the reduce task or the other way around. The efficient utilization of task trackers' memory was not possible on account of to this setup.
  • Non MapReduce jobs: Every job in Hadoop version 1 required MapReduce for its completion because scheduling was only possible through the JobTracker. The JobTracker and the task tracker were tightly coupled with the MapReduce framework. Since the adoption of Hadoop has been growing fast, there were a lot of new requirements, such as graph processing and real-time analytics that needed processing over the same HDFS storage to reduce complexity, infrastructure, maintenance cost, and so on.

Let us discuss the YARN architecture and see how it helps to resolve the limitations discussed before. The initial idea of YARN was to split the resource management and job scheduling responsibilities of the JobTracker. YARN consists of two major components: Resource Manager and the Node Manager. The Resource Manager is a master node that is responsible for managing resources in the cluster. Per-application Application Master running on the Node Manager is responsible for launching and monitoring containers of jobs. The cluster consists of one Resource Manager and multiple Resource Manager,as seen in the following diagram:

The preceding diagram can be explained as follows:

  • Resource Manager: The Resource Manager is a master daemon that is responsible for managing the resources of submitted applications. It has two primary components:
    • Scheduler: The job of Resource Manager Scheduler is to allocate the required resources requested by the per application application master. The job of Scheduler is to only schedule the job, which means it does not monitor any task and is not responsible for relaunching any failed application container. The application makes a request of job scheduling to the YARN and YARN sends detailed scheduling information, including the amount of memory required for the job. Upon receiving the scheduling request, the Scheduler simply schedules the job. 
    • Application Manager: The job of the Application Manager is to manage per application master. Each application submitted to the YARN will have its own application master and the Application Manager keeps track of each application master. Each client request for job submission is received by the Application Manager and it provides resources to launch application master for the application. It also destroys the application master upon completion of the application execution. When cluster resources become limited and already in use, the Resource Manager can request back the resources from a running application so that it can allocate it to the application.
  • Node manager: The Node Manager is a slave that runs on every worker node of a cluster and has responsibility for launching and executing containers based on instructions from the Resource Manager. The Node Manager sends heartbeat signals to the Resource Manager, and which also contain some other information,including Node Manager machine details, and available memory. The Resource Manager regularly updates the information of each Node Manager upon receiving the request, which helps in planning and scheduling upcoming tasks. The containers are launched on the Node Manager and the application master is also launched on the Node Manager container. 
  • Application master: The first step of the application is to submit the job to YARN, and upon receiving the request of a job submission, YARN's Resource Manager launches the application master for that particular job on one of the Node Manager Container. The application master is then responsible for managing application execution in the cluster. For each application, there will be a dedicated application master running on some Node Manager Container that is responsible for coordinating between Resource Manager and Node Manager in order to complete the execution of the application. The application master requests the required resources for application execution from the Resource Manager and the Resource Manager sends the detailed information about the Resource Container to the application master, which then coordinates with the respective Node Manager to launch the container to execute the application task. The application master sends heartbeats at regular intervals to the Resource Manager and updates its resource usage. The application master changes the plan of execution depending upon the response received from the Resource Manager.