
Distributed databases
Distributed filesystems, like conventional filesystems, are used to store files. In the case of distributed filesystems such as the HDFS, these files can be very large. Ultimately, however, they are used to store files. When data requires modeling, we need something more than just a filesystem; we need a database.
Distributed databases, just like single-machine databases, allow us to model our data. Unlike single-machine databases, however, the data, and the data model itself, spans, and is preserved across, all the nodes in a cluster acting as a single logical database. This means that not only can we take advantage of the increased performance, throughput, fault tolerance, resilience, and cost-effectiveness offered by distributed systems, but we can also model our data and thereafter query that data efficiently, no matter how large it is or how complex the processing requirements are. Depending on the type of distributed database, it can either be deployed on top of a distributed filesystem (such as Apache HBase deployed on top of the HDFS) or not.
In our big data ecosystem, it is often the case that distributed filesystems such as the HDFS are used to host data lakes. A data lake is a centralized data repository where data is persisted in its original raw format, such as files and object BLOBs. This allows organizations to consolidate their disparate raw data estate, including structured and unstructured data, into a central repository with no predefined schema, while offering the ability to scale over time in a cost-effective manner.
Thereafter, in order to actually deliver business value and actionable insight from this vast repository of schema-less data, data processing pipelines are engineered to transform this raw data into meaningful data conforming to some sort of data model that is then persisted into serving or analytical data stores typically hosted by distributed databases. These distributed databases are optimized, depending on the data model and type of business application, to efficiently query the large volumes of data held within them in order to serve user-facing business intelligence (BI), data discovery, advanced analytics, and insights-driven applications and APIs.
Examples of distributed databases include the following:
- Apache HBase: https://hbase.apache.org/
- Apache Cassandra: http://cassandra.apache.org/
- Apache CouchDB: http://couchdb.apache.org/
- Apache Ignite: https://ignite.apache.org/
- Greenplum Database: https://greenplum.org/
- MongoDB: https://www.mongodb.com/
Apache Cassandra is an example of a distributed database that employs a masterless architecture with no single point of failure that supports high throughput in processing huge volumes of data. In Cassandra, there is no master copy of the data. Instead, data is automatically partitioned, based on partitioning keys and other features inherent to how Cassandra models and stores data, and replicated, based on a configurable replication factor, across other nodes in the cluster. Since the concept of master/slave does not exist, a gossip protocol is employed so that the nodes in the Cassandra cluster may dynamically learn about the state and health of other nodes.
In order to process read and write requests from a client application, Cassandra will automatically elect a coordinator node from the available nodes in the cluster, a process that is invisible to the client. To process write requests, the coordinator node will, based on the partitioning features of the underlying distributed data model employed by Cassandra, contact all applicable nodes where the write request and replicas should be persisted to. To process read requests, the coordinator node will contact one or more of the replica nodes where it knows the data in question has been written to, again based on the partitioning features of Cassandra. The underlying architecture employed by Cassandra can therefore be visualized as a ring, as illustrated in Figure 1.4. Note that although the topology of a Cassandra cluster can be visualized as a ring, that does not mean that a failure in one node results in the failure of the entire cluster. If a node becomes unavailable for whatever reason, Cassandra will simply continue to write to the other applicable nodes that should persist the requested data, while maintaining a queue of operations pertaining to the failed node. When the non-functional node is brought back online, Cassandra will automatically update it:
