
Lazy persist writes in HDFS
Enterprise adoption of Hadoop is growing day by day. With increased adoption, there are a variety of application types that are using Hadoop for their enterprise goals. One such adoption is for applications that need to deal with data that amounts to only a few GBs. Keeping performance goals in mind with such small records would incur more latency costs when DISK I/O writes are involved during its execution—especially when such volumes of data can easily fit into memory without any DISK I/O. With the release of Hadoop 2.6, provisions for writes have been introduced that will use the off-heap memory of DataNodes. Eventually, data from memory will be flushed out to disk asynchronously. This will remove any expensive Disk I/O and computations for checksum while write operations are initiated from the HDFS client. Such asynchronous writes are called lazy persist writes, where persistence to disk does not happen immediately but asynchronously after some time. HDFS provides a best effort guarantee against data loss. However, there are some rare chances of data loss. In case a data node restarts before replicas are persisted to disk, there are some chances of data loss. We can minimize this risk by avoiding any such writes for an amount of time before restarting. However, there is never an absolute guarantee against data loss due to restarts. For precisely this reason, we should always use this feature for data that is temporary in nature and can be regenerated by rerunning the operation. Another important aspect of lazy writes is that you should use a file with one replica configured for it. If you enable multiple replicas for a file, then write operations cannot be completed unless all replicas are written to different DataNodes. This would defeat the low-latency purpose of memory writes as replication would involve multiple data transfers over a network. If you want, you can enable this off the hot write path by enabling replication of files later (probably asynchronously) when writes are complete. However, there is a chance of data loss in the event of disk failure before you have completed data replication.
Generally, to set up lazy persists optimization in Hadoop, you need to set up RAM disks. RAM disks are virtual hard drives on your RAM memory. At first glance, it looks like a regular drive on your PC, but it segregates a fixed amount of RAM memory and it will not be available for any other processes. For Hadoop memory storage support, RAM disks were chosen as they have better persistence support in the event that the DataNode restarts. RAM disks have provisions for automatically saving content to the hard drive before it restarts.
https://hadoop.apache.org/docs/r3.0.0-beta1/hadoop-project-dist/hadoop-hdfs/MemoryStorage.html#Use_the_LAZY_PERSIST_Storage_Policy.