Mastering Hadoop 3
上QQ阅读APP看书,第一时间看更新

Managing disk-skewed data in Hadoop 3.x

Over any period of time, when you're producing a Hadoop cluster, there is always a need to manage disks on DataNodes. It could be the case that you must replace corrupted disks or you must add more disks for more data volumes. Another possibility is that your disks volumes vary in same data nodes. All such cases would result in uneven data distribution across all of the disks in a DataNode. Another reason that can result in uneven data distribution is round robin-based disk writes and random deletes.

To prevent such problems from occurring prior to the release of Hadoop 3, Hadoop administrators were applying methods that were far from ideal. One solution was to shut down your data node and use the UNIX mv command to move block replicas along with supported metadata files from one directory to another directory. Each of those directories should be using different disks. You need to ensure that subdirectory names are not changed, otherwise, upon rebooting, the DataNode would not be able to identify that block replica. This is cumbersome and not ideal if you have a very large Hadoop cluster. What you actually need is a diskbalancer tool that will do these operations for you automatically. That tool should also give you a complete picture of disk usage and how much of the disk is occupied on each DataNode. With those concerns in mind, the Hadoop community introduced a DataNode diskbalancer tool that has the following abilities:

  • Disk data distribution report: The HDFS diskbalancer tool generates reports to identify DataNodes that suffer from asymmetric data distribution. There are two types of reports that can be generated. One report is about top nodes that have possible data skew and that can benefit from running this tool, while the other report is about detailed information about data nodes that you want information about. A node's IPs/DNS can be passed into a file as an argument.
  • Performing disk balancing on live DataNodes: This is the core functionality of this tool. It can move data block folders from one volume to another. This works in three phases: discover, plan and execute. The discover phase is more about cluster discovery, where information such as the physical layout of cluster computes and storage types is stored. The plan phase is about steps that should be performed for each user-specified data node, and how data should be moved and in what sequence. This phase takes input from the discover phase. The execute phase is about executing plans that are received from the planning phase by each DataNode. This runs in the background without affecting user activity.

The HDFS diskbalancer tool, after executing the balancing activity, generates two types of reports per DataNode for debugging and verification purposes. One is called <datanode>.before.json and the other is called <datanode>.after.json. These reports include disk storage state information about each DataNode before running the tool and after running the tool. You can always compare the two reports and see whether you want to re-run the balancer or whether it is sufficient at any given point in time. The following table presents some of the commands that can be used to run hdfs diskbalancer:

The preceding table covers some of the diskbalancer commands at a high level. If you need more details about hdfs diskbalancer commands, then check out this link:
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html.