Seven NoSQL Databases in a Week
上QQ阅读APP看书,第一时间看更新

How does Neo4j work?

Neo4j stores nodes, edges, and properties on disk in stores that are specific to each type—for example, nodes are stored in the node store.[5, s.11] They are also stored in two types of caches—the file system (FS) and the node/relationship caches. The FS cache is divided into regions for each type of store, and data is evicted on a least-frequently-used (LFU) policy.

Data is written in transactions assembled from commands and sorted to obtain a predictable update order. Commands are sorted at the time of creation, with the aim of preserving consistency. Writes are added to the transaction log and either marked as committed or rolled back (in the event of a failure). Changes are then applied (in sorted order) to the store files on disk.

It is important to note that transactions in Neo4j dictate the state and are therefore idempotent by nature.[5, s.34] They do not directly modify the data. Reapplying transactions for a recovery event simply replays the transactions as of a given safe point.

In a high-availability (HA), clustered scenario, Neo4j embraces a master/slave architecture. Transaction logs are then shared between all Neo4j instances, regardless of their current role. Unlike most master/slave implementations, slave nodes can handle both reads and writes.[5, s.37] On a write transaction, the slave coordinates a lock with the master and buffers the transaction while it is applied to the master. Once complete, the buffered transaction is then applied to the slave.

Another important aspect of Neo4j's HA architecture is that each node/edge has its own unique identifier (ID). To accomplish this, the master instance allocates the IDs for each slave instance in blocks. The blocks are then sent to each instance so that IDs for new nodes/edges can be applied locally, preserving consistency, as shown in the following diagram:

A graphical representation of the CAP theorem, using the corners of a triangle to denote the design aspects of consistency, availability, and partition tolerance

When looking at Neo4j within the context of Brewer's CAP theorem (formerly known as both Brewer's CAP principle and Brewer's CAP conjecture), its designation would be as a CP system.[3, p.1] It earns this designation because of its use of locking mechanisms to support Consistency (C) over multiple, horizontally-scaled instances in a cluster. Its support for clustering multiple nodes together indicates that Neo4j is also Partition tolerant (P).

While the Enterprise Edition of Neo4j does offer high-availability clustering as an option, there are a limited number of nodes that can accept write operations. Despite the name, this configuration limits its ability to be considered highly-available within the bounds of the CAP theorem.