Pattern matching and data retrieval_Hands-On Graph Analytics with Neo4j-QQ阅读男生中文轻小说网

上QQ阅读APP看书，第一时间看更新

Pattern matching and data retrieval

The full power of graph databases, and Neo4j in particular, lies in their ability to go from one node to another by following relationships in a super-fast way. In this section, we explain how to read data from Neo4j through pattern matching and hence take full advantage of the graph structure.

Pattern matching

Let's take the following query:

MATCH ()-[r]-()
RETURN r

When we write these kinds of queries, we are actually performing what is called pattern matching with graph databases. The following schema explains this concept:

In this scenario, we have a directed graph made of nodes with labels A or B. We are looking for the sequence A -> B. Pattern matching consists of moving a stencil along the graph and seeing which pairs of nodes and relationships are consistent with it. On the first iteration, both the node labels and the relationship direction matches the search pattern. But on the second and third iterations, the node labels are not the expected ones, and these patterns are rejected. On iteration four, the labels are right, but the relationship orientation is reversed, which makes the matching fail again. Finally, on the last iteration, the pattern is respected even if not drawn in the right order: we have a node A connected with an outbound relationship to a node B.

Test data

Let's first create some test data to experiment with. We'll use the states of the United States as a playground. A node in our graph will be a state, with its two-letter code, name, and rounded population as properties. Those states are connected when they share a common border through a relationship of type SHARE_BORDER_WITH:

Image credit: https://commons.wikimedia.org/wiki/File:Labelled_US_map.svg

Here is our sample data, created from the preceding image, using only states up to two degrees of separation away from Florida (FL):

CREATE (FL:State {code: "FL", name: "Florida", population: 21500000})
CREATE (AL:State {code: "AL", name: "Alabama", population: 4900000})
CREATE (GA:State {code: "GA", name: "Georgia", population: 10600000})
CREATE (MS:State {code: "MS", name: "Mississippi", population: 3000000})
CREATE (TN:State {code: "TN", name: "Tennessee", population: 6800000})
CREATE (NC:State {code: "NC", name: "North Carolina", population: 10500000})
CREATE (SC:State {code: "SC", name: "South Carolina", population: 5100000})


CREATE (FL)-[:SHARE_BORDER_WITH]->(AL)
CREATE (FL)-[:SHARE_BORDER_WITH]->(GA)
CREATE (AL)-[:SHARE_BORDER_WITH]->(MS)
CREATE (AL)-[:SHARE_BORDER_WITH]->(TN)
CREATE (GA)-[:SHARE_BORDER_WITH]->(AL)
CREATE (GA)-[:SHARE_BORDER_WITH]->(NC)
CREATE (GA)-[:SHARE_BORDER_WITH]->(SC)
CREATE (SC)-[:SHARE_BORDER_WITH]->(NC)
CREATE (TN)-[:SHARE_BORDER_WITH]->(MS)
CREATE (NC)-[:SHARE_BORDER_WITH]->(TN)

We will now use this data to understand graph traversal.

Graph traversal

Graph traversal consists of going from one node to its neighbors by following an edge (relationship) in a given direction.

Orientation

While relationships must be oriented when creating them, pattern matching can be performed by taking this orientation into account or not. The relationship between two nodes, a and b, can be of three kinds (with respect to a):

OUTBOUND: (a) -[r]->(b)
INBOUND:  (a)<-[r]- (b)
BOTH:     (a) -[r]- (b)

Our USA states graph is undirected, so we will only use the BOTH relationship syntax. For instance, let's find the direct neighbors of Florida and return their names:

MATCH (:State {code: "FL"})-[:SHARE_BORDER_WITH]-(n)
RETURN n.name

This leads to the following result:

╒═════════╕
│"n.name" │
╞═════════╡
│"Georgia"│
├─────────┤
│"Alabama"│
└─────────┘

It would be interesting to also see the state population, and order the result by this value, wouldn't it?

MATCH (:State {code: "FL"})-[:SHARE_BORDER_WITH]-(n)
RETURN n.name as state_name, n.population as state_population
ORDER BY n.population DESC

Here's the corresponding result:

╒════════════╤══════════════════╕
│"state_name"│"state_population"│
╞════════════╪══════════════════╡
│"Georgia"   │10600000          │
├────────────┼──────────────────┤
│"Alabama"   │4900000           │
└────────────┴──────────────────┘

In this query, we are only interested in the direct neighbors of Florida, meaning only one hop from the starting node. But with Cypher, we can traverse more relationships.

The number of hops

For instance, if we also want the neighbors of the neighbors of Florida, we could use this:

MATCH (:State {code: "FL"})-[:SHARE_BORDER_WITH]-(neighbor)-[:SHARE_BORDER_WITH]-(neighbor_of_neighbor)
RETURN neighbor_of_neighbor

This returns six nodes. If you check the result carefully, you will possibly be surprised to realize that it contains Alabama, for example, which is a direct neighbor of Florida. That's true, but Alabama is also a neighbor of Tennessee, which is a neighbor of Florida, so Alabama is also a neighbor of a neighbor of Florida. If we only want the neighbors of neighbors that are not direct neighbors of Florida, we have to explicitly exclude them:

MATCH (FL:State {code: "FL"})-[:SHARE_BORDER_WITH]-(neighbor)-[:SHARE_BORDER_WITH]-(neighbor_of_neighbor)
WHERE NOT (FL)-[:SHARE_BORDER_WITH]-(neighbor_of_neighbor) 
RETURN neighbor_of_neighbor

This time, the query returns only four results: South Carolina, North Carolina, Tennessee, and Mississippi.

Variable-length patterns

When the relationship is of the same type, as in our example, or, if we do not care about the relationship type, we can use the following shortcut:

MATCH (:State {code: "FL"})-[:SHARE_BORDER_WITH*2]-(neighbor_of_neighbor)
RETURN neighbor_of_neighbor

This will return the same six results that we have already seen in the previous section with this query:

(FL:State {code: "FL"})-[:SHARE_BORDER_WITH]-(neighbor)-[:SHARE_BORDER_WITH]-(neighbor_of_neighbor)

You can give lower and upper values for the number of hops with this syntax:

[:SHARE_BORDER_WITH*<lower_value>..<upper_value>]

For instance, [:SHARE_BORDER_WITH*2..3] will return the neighbors with two or three degrees of separation.

It is even possible to use whatever path length using the * notation, like so:

[:SHARE_BORDER_WITH*]

This will match paths regardless of the number of relationships. However, this syntax is not recommended since it can create a huge performance decrease.

Optional matches

Some states do not share any borders with another US state. Let's add Alaska to our test graph:

CREATE (CA:State {code: "AK", name: "Alaska", population: 700000 })

In the case of Alaska, the query we wrote before to get the neighbors will actually return zero results:

MATCH (n:State {code: "AK"})-[:SHARE_BORDER_WITH]-(m)
RETURN n, m

Indeed, no pattern matches the sequence ("AK")-SHARE_BORDER_WITH-().

In some cases, we might want to see Alaska in the results anyway. For instance, knowing that Alaska has zero neighbors is information in itself. In that case, we would use OPTIONAL MATCH pattern matching:

MATCH (n:State {code: "AK"})
OPTIONAL MATCH (n)-[:SHARE_BORDER_WITH]-(m)
RETURN n.name, m.name

This query returns the following result:

╒════════╤════════╕
│"n.name"│"m.name"│
╞════════╪════════╡
│"Alaska"│null    │
└────────┴────────┘

The neighbor name, m.name, is NULL because no neighbor was found, but Alaska is part of the result.

We now have a better view of the way Cypher performs pattern matching. The next section will show how to perform aggregations such as count or sum, and handle lists of objects.