
上QQ阅读APP看书,第一时间看更新
Points to remember
We have covered the HDFS in detail and the following are a few points to remember:
- HDFS consists of two main components: NameNode and DataNode. NameNode is a master node that stores metadata information, whereas DataNodes are slave nodes that store file blocks.
- Secondary NameNode is responsible for performing checkpoint operations in which edit log changes are applied to fsimage. This is also known as a checkpoint node.
- Files in HDFS are split into blocks and blocks are replicated across a number of DataNodes to ensure fault tolerance. The replication factor and block size are configurable.
- HDFS Balancer is used to distribute data in an equal fashion between all DataNodes. It is a good practice to run balancer whenever a new DataNode is added and schedule a job to run balancer at regular intervals.
- In Hadoop 3, high availability can now have more than two NameNodes running at a time. If an active NameNode fails, a new NameNode will be elected from an other NameNode and will become an active NameNode.
- Quorum Journal Manager writes namespace modifications into multiple JournalNodes. These changes are then read by the Standby NameNode and they apply these changes to their fsimage file.
- Erasure coding is a new feature that was introduced in Hadoop 3, which reduces storage overhead by up to 50%. The replication factor in HDFS costs us 200% more space. Erasure coding provides the same durability guarantee using less disk storage.