Mastering Hadoop 3
上QQ阅读APP看书,第一时间看更新

Points to remember

We have covered the HDFS in detail and the following are a few points to remember:

  • HDFS consists of two main components: NameNode and DataNode. NameNode is a master node that stores metadata information, whereas DataNodes are slave nodes that store file blocks.
  • Secondary NameNode is responsible for performing checkpoint operations in which edit log changes are applied to fsimage. This is also known as a checkpoint node.
  • Files in HDFS are split into blocks and blocks are replicated across a number of DataNodes to ensure fault tolerance. The replication factor and block size are configurable.
  • HDFS Balancer is used to distribute data in an equal fashion between all DataNodes. It is a good practice to run balancer whenever a new DataNode is added and schedule a job to run balancer at regular intervals.
  • In Hadoop 3, high availability can now have more than two NameNodes running at a time. If an active NameNode fails, a new NameNode will be elected from an other NameNode and will become an active NameNode.
  • Quorum Journal Manager writes namespace modifications into multiple JournalNodes. These changes are then read by the Standby NameNode and they apply these changes to their fsimage file.
  • Erasure coding is a new feature that was introduced in Hadoop 3, which reduces storage overhead by up to 50%. The replication factor in HDFS costs us 200% more space. Erasure coding provides the same durability guarantee using less disk storage.