
Metadata management
HDFS stores a large amount of structured and unstructured data in various formats. While the data is continuously growing to terabytes and petabytes, and your data is being used by Hadoop, you are likely to come across questions, such as what data is available on HDFS, how it is being used, and what type of users are using the data, the data creation timeline, and so on. Well-maintained metadata information can effectively answer these questions and thus improve the usability of the data store over HDFS.
NameNode keeps the complete fsimage in memory so that all the metadata information requests can be served in the smallest amount of time possible and persist fsimage and edit logs on the disk. fsimage contains HDFS directory information, file information, permissions, quotas, last access times and last modification times, and block IDs for files.
HDFS metadata information includes various attributes of directories and files, such as ownership, permissions, quotas, replication factors, and much more. This information is in two files:
- fsimage: An fsimage file contains the complete state of the File System to which every File System modification is assigned a unique and an unmodulated increasing transaction ID. An fsimage file represents the File System state up to a specific transaction ID.
Let's see how we can analyze the content of fsimage by taking various usage patterns that can help us check the health of the File System. The following command can be used to fetch the latest fsimage from NameNode:
hdfs dfsadmin -fetchImage /home/packt
The fsimage file is not in a human-readable format. The Offline Image Viewer tool is used to convert fsimage file content into a human-readable format. It also provides the WebHDFS API, which helps with offline fsimage analysis. Let's see how we can use it and what options are available with the Offline Image Viewer tool:
hdfs oiv --help
The preceding command will return the following output, which contains details about its usage and options:

We will convert fsimage content into a tab delimited file:
hdfs oiv -i /home/packt/fsimage_00000000007685439 -o /home/packt/fsimage_output.csv -p Delimited
Now, we have the available information stored in fsimage in the form of a tab delimited file and we can expose a hive table on top of it after removing the header from the file. The header can be removed using the linux utility or any file editor. You can use the following command to do this:
sed -i -e "1d" /home/packt/fsimage_output.csv
Once the header has been removed, we can expose the hive table on top of this fsimage file:
- edits: An edits log file contains a list of changes that are applied on the File System after the most recent fsimage. The edit log contains an entry for each operation and the checkpoint operation periodically merges fsimage and the edit log by applying all of the changes that are available in the edit logs in fsimage before saving the new fsimage.
The edit log file is available in binary format and we can convert it into human-readable XML format using the following command:
sudo hdfs oev -i /hadoop/hdfs/namenode/current/
edits_0000000000000488053-0000000000000488074 -o editlog.xml
After using the preceding command, we will see the content of the edit log file, which consists of different attributes:

A new record entry is made for every new operation. The structure of the record entry is as follows:
<RECORD>
<OPCODE>OP_ADD</OPCODE>
<DATA>
<TXID>488055</TXID>
<LENGTH>0</LENGTH>
<INODEID>190336</INODEID>
<PATH>/tmp/hive/hive/124dd7e2-d4d3-413e-838e-3dbbbd185a69/inuse.info</PATH>
<REPLICATION>3</REPLICATION>
<MTIME>1509663411169</MTIME>
<ATIME>1509663411169</ATIME>
<BLOCKSIZE>134217728</BLOCKSIZE>
<CLIENT_NAME>DFSClient_NONMAPREDUCE_1006023362_1</CLIENT_NAME>
<CLIENT_MACHINE>10.1.2.26</CLIENT_MACHINE>
<OVERWRITE>true</OVERWRITE>
<PERMISSION_STATUS>
<USERNAME>hive</USERNAME>
<GROUPNAME>hdfs</GROUPNAME>
<MODE>420</MODE>
</PERMISSION_STATUS>
<RPC_CLIENTID>ad7a6982-fde8-4b8a-8e62-f9a04c3c228e</RPC_CLIENTID>
<RPC_CALLID>298220</RPC_CALLID>
</DATA>
</RECORD>
Here, OPCODE represents the type of operation that's performed on the file that's available at the PATH location.
Now, we will see how the checkpoint operation works and what steps are involved in the operation.