
The need for highly redundant storage
With an increase in the available space for applications comes an increased demand on the storage. Applications may require access to their information all of the time without any disruption that could cause the entire business continuity to be at risk. No company wants to have to deal with an outage, let alone an interruption in the central infrastructure that leads to money being lost, customers not being served, and users not being able to log in to their accounts because of bad decisions.
Let's consider storing data on a traditional monolithic storage array—doing this can cause significant risks as everything is in a single place. A single massive storage array containing all of the company's information signifies an operational risk as the array is predisposed to fail. Every single type of hardware—no matter how good—fails at some point.
Monolithic arrays tend to handle failures by providing some form of redundancy through the use of traditional RAID methods used on the disk level. While this is good for small local storage that serves a couple of hundred users, this might not be a good idea when we reach the petascale and storage space and active concurrent users increase drastically. In specific scenarios, a RAID recovery can cause the entire storage system to go down or degrade performance to the point that the application doesn't work as expected. Additionally, with increased disk sizes and single-disk performance being the same over the past couple of years, recovering a single disk now takes a more substantial amount of time; rebuilding 1 TB disks is not the same as rebuilding 10 TB disks.
Storage clusters, such as GlusterFS, handle redundancy differently by providing methods that best fit the workload. For example, when using a replicated volume, data is mirrored from one node to another. If a node goes down, then traffic is seamlessly directed to the remaining nodes, being utterly transparent to the users. Once the problematic node is serviced, it can be quickly put back into the cluster, where it will go through self-healing of the data. In comparison to traditional storage, a storage cluster removes the single point of failure by distributing data to multiple members of the clusters.
Having increased availability means that we can reach the application service-level agreements and maintain the desired uptime.