Importing data from CSV or JSON_Hands-On Graph Analytics with Neo4j-QQ阅读男频历史网

上QQ阅读APP看书，第一时间看更新

Importing data from CSV or JSON

Even if you start your business with Neo4j as a core database, it is very likely you will have to import some static data into your graph. We will also need to perform that kind of operation within this book. In this section, we detail several ways of bulk-feeding Neo4j with different tools and different input data formats.

Data import from Cypher

Cypher itself contains utilities to import data in CSV format from a local or remote file.

File location

Whether importing CSV, JSON, or another file format, this file can be located in the following places:

Online and reachable through a public URL: 'http://example.com/data.csv'
On your local disk: 'files:///data.csv'

Local file: the import folder

In the latter case, with default Neo4j configuration, the file has to be in the /imports folder. Finding this folder is straightforward with Neo4j Desktop:

Click on the Manage button on the graph you are interested in.
Identify the Open folder button at the top of the new window.
Click on the arrow next to this button and select Import.

This will open your file browser inside your graph import folder.

If you prefer the command line, instead of clicking on Open Folder, you can use the Open Terminal button. In my local Ubuntu installation, it opens a session whose working directory is as follows:

~/.config/Neo4j Desktop/Application/neo4jDatabases/database-c83f9dc8-f2fe-4e5a-8243-2e9ee29e67aa/installation-3.5.14

The path on your system will be different since you will have a different database ID and maybe a different Neo4j version.

This directory structure is as follows:

$ tree -L 1
.
├── bin
├── certificates
├── conf
├── data
├── import
├── lib
├── LICENSES.txt
├── LICENSE.txt
├── logs
├── metrics
├── NOTICE.txt
├── plugins
├── README.txt
├── run
└── UPGRADE.txt

10 directories, 5 files

Here's some notes about the content of this directory:

data: Actually contains your data, especially data/databases/graph.db/ – the folder you can copy from one computer to another to retrieve your graph data.
bin: Contains some useful executables such as the import tool we'll discuss in the next section.
import: Put the files you want to import into your graph here.
plugins: if you have installed the APOC plugin, you should see apoc-<version>.jar in that folder. All plugins will be downloaded here, and if we want to add plugins not officially supported by Neo4j Desktop, it is enough to copy the jar file in this directory.

Changing the default configuration to import a file from another directory

The default import folder can be configured by changing the dbms.directories.import parameter in the conf/neo4j.conf configuration file:

# This setting constrains all `LOAD CSV` import files to be under the `import` directory. Remove or comment it out to
# allow files to be loaded from anywhere in the filesystem; this introduces possible security problems. See the
# `LOAD CSV` section of the manual for details.
dbms.directories.import=import

CSV files

CSV files are imported using the LOAD CSV Cypher statement. Depending on whether you can/want to use the headers, the syntax is slightly different.

CSV files without headers

If your file does not contain column headers, or you prefer ignoring them, you can refer to columns by indexes:

LOAD CSV FROM 'path/to/file.csv' AS row
CREATE (:Node {name: row[1]

Column indexes start with 0.

CSV files with headers

However, in most cases, you will have a CSV file with named columns. In that case, it is much more convenient to use a column header as a reference instead of numbers. This is possible with Cypher by specifying the WITH HEADERS option in the LOAD CSV query:

LOAD CSV WITH HEADERS FROM '<path/to/file.csv>' AS row
CREATE (:Node {name: row.name})

Let's practice with an example. The usa_state_neighbors_edges.csv CSV file has the following structure:

code;neighbor_code
NE;SD
NE;WY
NM;TX
...

This can be explained as follows:

code is the two-letter state identifier (for example, CO for Colorado).
neighbor_code is the two-letter identifier of a state sharing a border with the current state.

Our goal is to create a graph where each state is a node, and we create a relationship between two states if they share a common border.

So, let's get started:

Fields in this CSV file are delimited with semi-colons, ;, so we have to use the FIELDTERMINATOR option (the default is a comma, ,).
The first column contains a state code; we need to create the associated node.
The last column also contains a state code, so we have to check whether this state already exists and create it if not.
Finally, we can create a relationship between the two states, arbitrarily chosen to be oriented from the first state to the second one:

LOAD CSV WITH HEADERS FROM "file:///usa_state_neighbors_edges.csv" AS row FIELDTERMINATOR ';'
MERGE (n:State {code: row.code})
MERGE (m:State {code: row.neighbor_code})
MERGE (n)-[:SHARE_BORDER_WITH]->(m)

This results in the graph displayed here, which is a representation of the United States:

It is interesting to note the special role of New York state, which completely splits the graph into two parts: states on one side of NY are never connected to a state from the other side of NY. Chapter 6, Node Importance, will describe the algorithms able to detect such nodes.

Our current graph structure has at least one problem: it does not contain states with no common borders, such as Alaska and Hawaii. To fix this issue, we will use another data file with a different format but that also contains the states without shared borders:

code;neighbors
CA;OR,NV,AZ
NH;VT,MA,ME
OR;WA,CA,ID,NV
...
AK;""
...

As you can see, we now have one row per state that contains a list of its neighbors. If the state does not have any neighbors, it is present in the file but the neighbors column contains a null value.

In reality, to prevent adding a relationship between states A and B and a second relationship between states B and A, the neighbors column only contains the neighbors with name < state_name. That's the reason why we have the row TX;"", while we know that Texas does have neighbors.

The query to import this file can be written as follows:

LOAD CSV WITH HEADERS FROM "file:///usa_state_neighbors_all.csv" AS row FIELDTERMINATOR ';'
WITH row.code as state, split(row.neighbors, ',') as neighbors
MERGE (a:State {code: state})
WITH a, neighbors
UNWIND neighbors as neighbor
WITH a, neighbor
WHERE neighbor <> ""
MERGE (b:State {code: neighbor})
CREATE (a)-[:SHARE_BORDER_WITH]->(b)

A few notes to better understand this query:

We use the split() function to create a list from a comma-separated list of state codes.
The UNWIND operator creates one row for each element in the list of neighbor codes.
We need to filter out the states with no neighbors from the rest of the query since Cypher cannot use a NULL value as an identifier when merging nodes. However, since the WHERE clause happens after the first MERGE, states without neighbors will still be created.

If you see an error or unexpected results when using LOAD CSV, you can debug by returning intermediate results. This can be achieved, for instance, like this:

LOAD CSV WITH HEADERS FROM "file:///usa_state_neighbors_all.csv" AS row FIELDTERMINATOR ';'
WITH row LIMIT 10
RETURN row

Using a LIMIT function is not mandatory but can be better for performance if you are using a very large file.

Eager operations

If you look closely, Neo4j Desktop is showing a small warning sign next to the query text editor. If you click on this warning, it will show an explanation about it. In our case, it says this:

The execution plan for this query contains the Eager operator, which forces all dependent data to be materialized in main memory before proceeding

This is not directly related to a data import, but this is often the first time we face this warning message, so let's try to understand it.

The Neo4j documentation defines the Eager operator in this sentence:

For isolation purposes, the operator ensures that operations affecting subsequent operations are executed fully for the whole dataset before continuing execution.

In other words, each statement of the query is executed for each row of the file, before moving to the other row. This is usually not a problem because a Cypher statement will deal with a hundred nodes or so, but when importing large data files, the overhead is noticeable and may even lead to OutOfMemory errors. This then needs to be taken into account.

In the case of a data import, the Eager operator is used because we are using MERGE statements, which forces Cypher to check whether the nodes and relationships exist for the whole data file.

To overcome this issue, several solutions are possible, depending on the input data:

If we are sure the data file does not contain duplicates, we can replace MERGE operations with CREATE.
But most of the time, we will instead need to split the import statement into two or more parts.

The solution to load the US states would be to use three consecutive queries:

// first create starting state node if it does not already exist
LOAD CSV WITH HEADERS FROM "file:///usa_state_neighbors_edges.csv" AS row FIELDTERMINATOR ';'
MERGE (:State {code: row.code})

// then create the end state node if it does not already exist
LOAD CSV WITH HEADERS FROM "file:///usa_state_neighbors_edges.csv" AS row FIELDTERMINATOR ';
MERGE (:State {code: row.neighbor_code})

// then create relationships
LOAD CSV WITH HEADERS FROM "file:///usa_state_neighbors_edges.csv" AS row FIELDTERMINATOR ';'
MATCH (n:State {code: row.code})
MATCH (m:State {code: row.neighbor_code})
MERGE (n)-[:SHARE_BORDER_WITH]->(m)

The first two queries create the State nodes. If a state code appears several times in the file, the MERGE operation will take note of creating two distinct nodes with the same code.

Once this is done, we again read the same file to create the neighborhood relationships: we start from reading the State node from the graph with a MATCH operation and then create a unique relationship between them. Here, again, we used the MERGE operation rather than CREATE to prevent having the same relationship twice between the same two nodes.

We had to split the first two statements into two separate queries because they are acting on the same node label. However, a statement like the following one will not rely on the Eager operator:

LOAD CSV WITH HEADERS FROM "file:///data.csv" AS row 
MERGE (:User {id: row.user_id})
MERGE (:Product {id: row.product_id})

Indeed, since the two MERGE nodes involve two different node labels, Cypher does not have to execute all the operations for the first line to make sure there is no conflict with the second one; the operations are independent.

In the APOC utilities for imports section, we will study another representation of the US dataset, which we will be able to import without writing three different queries.

Before that, let's have a look at the built-in Neo4j import tool.

Data import from the command line

Neo4j also provides a command-line import tool. The executable is located in $NEO4J_HOME/bin/import. It requires several CSV files:

One or several CSV file(s) for nodes with the following format:

id:ID,:LABEL,code,name,population_estimate_2019:int
1,State,CA,California,40000000
2,State,OR,Oregon,4000000
3,State,AZ,Arizona,7000000

It is mandatory to have a unique identifier for nodes. This identifier must be identified with the :ID keyword.
All fields are parsed as a string, unless the type is specified in the header with the :type syntax.

One or several CSV file(s) for relationships with the following format:

:START_ID,:END_ID,:TYPE,year:int
1,2,SHARE_BORDER_WITH,2019
1,3,SHARE_BORDER_WITH,2019

Once the data files are created and located in the import folder, you can run the following:

bin/neo4j-admin import --nodes=import/states.csv --relationships=import/rel.csv

If you have very large files, the import tool can be much more convenient since it can manage compressed files (.tar, .gz, or .zip) and also understand header definitions in separate files, which makes it easier to open and update.

The full documentation about the import tool can be found at https://neo4j.com/docs/operations-manual/current/tutorial/import-tool/.

APOC utilities for imports

The APOC library is a Neo4j extension that contains several tools to ease working with this database:

Data import and export: from and to different formats like CSV and JSON but also HTML or a web API
Data structure: advanced data manipulation, including type conversion functions, maps, and collections management
Advanced graph querying functions: tools to enhance pattern matching, including more conditions
Graph projections: with virtual nodes and/or relationships

The first implementations of graph algorithms were done within that library, even if they have now been deprecated in favor of a dedicated plugin we will discover in part 2 of this book.

We will only detail in this section the tools related to data import, but I encourage you to take a look at the documentation to learn what can be achieved with this plugin: https://neo4j.com/docs/labs/apoc/current/.

When executing the code in the rest of this chapter, you may get an error saying the following:

There is no procedure with the name apoc.load.jsonParams registered for this database instance

If so, you will have to add the following line to your neo4j.conf setting(the Settings tab in the Graph Management area in Neo4j Desktop):

dbms.security.procedures.whitelist= apoc.load.*

CSV files

The APOC library contains a procedure to import CSV files. The syntax is the following:

CALL apoc.load.csv('')
YIELD name, age
CREATE (:None {name: name, age: age})

As an exercise, try and import the USA state data with this procedure.

Similar to the LOAD CSV statement, the file to be updated needs to be inside the import folder of your graph. However, you should not include the file:// descriptor, which would trigger an error.

JSON files

More importantly, APOC also contains a procedure to import data from JSON, which is not possible yet with vanilla Cypher. The structure of the query is as follows:

CALL apoc.load.json('http://...') AS value
UNWIND value.items AS item
CREATE (:Node {name: item.name}

As an example, we will import some data from GitHub using the GitHub API: https://developer.github.com/v3/.

We can get the list of repositories owned by the organization Neo4j with this request:

curl -u "<your_github_username>" https://api.github.com/orgs/neo4j/repos

Here is a sample of the data you can get for the given repository (with chosen fields):

   {
    "id": 34007506,
    "node_id": "MDEwOlJlcG9zaXRvcnkzNDAwNzUwNg==",
    "name": "neo4j-java-driver",
    "full_name": "neo4j/neo4j-java-driver",
    "private": false,
    "owner": {
      "login": "neo4j",
      "id": 201120,
      "node_id": "MDEyOk9yZ2FuaXphdGlvbjIwMTEyMA==",
      "html_url": "https://github.com/neo4j",
      "followers_url": "https://api.github.com/users/neo4j/followers",
      "following_url": "https://api.github.com/users/neo4j/following{/other_user}",
      "repos_url": "https://api.github.com/users/neo4j/repos",
      "type": "Organization"
    },
    "html_url": "https://github.com/neo4j/neo4j-java-driver",
    "description": "Neo4j Bolt driver for Java",
    "contributors_url": "https://api.github.com/repos/neo4j/neo4j-java-driver/contributors",
    "subscribers_url": "https://api.github.com/repos/neo4j/neo4j-java-driver/subscribers",
    "commits_url": "https://api.github.com/repos/neo4j/neo4j-java-driver/commits{/sha}",
    "issues_url": "https://api.github.com/repos/neo4j/neo4j-java-driver/issues{/number}",
    "created_at": "2015-04-15T17:08:15Z",
    "updated_at": "2020-01-02T10:20:45Z",
    "homepage": "",
    "size": 8700,
    "stargazers_count": 199,
    "language": "Java",
    "license": {
      "key": "apache-2.0",
      "name": "Apache License 2.0",
      "spdx_id": "Apache-2.0",
      "node_id": "MDc6TGljZW5zZTI="
    },
    "default_branch": "4.0"
  }

We will import this data into a new graph, using APOC. To do so, we have to enable file import with APOC by adding the following line to the Neo4j configuration file (neo4j.conf):

apoc.import.file.enabled=true

Let's now read this data. You can see the result of the apoc.load.json procedure with the following:

CALL apoc.load.json("neo4j_repos_github.json") YIELD value AS item
RETURN item
LIMIT 1

This query produces a result similar to the preceding sample JSON. To access the fields in each JSON file, we can use the item.<field> notation. So, here is how to create a node for each repository and owner, and a relationship between the owner and the repository:

CALL apoc.load.json("neo4j_repos_github.json") YIELD value AS item
CREATE (r:Repository {name: item.name,created_at: item.created_at, contributors_url: item.contributors_url} )
MERGE (u:User {login: item.owner.login})
CREATE (u)-[:OWNS]->(r)

Checking the content of the graph, we can see this kind of pattern:

We can do the same to import all the contributors to the Neo4j repository:

CALL apoc.load.json("neo4j_neo4j_contributors_github.json") 
YIELD value AS item
MATCH (r:Repository {name: "neo4j"})
MERGE (u:User {login: item.login})
CREATE (u)-[:CONTRIBUTED_TO]->(r)

Importing data from a web API

You may have noticed that the JSON returned by GitHub contains a URL to extend our knowledge about repositories or users. For instance, in the neo4j_neo4j_contributors_github.json file, there is a followers URL. Let's see how to use APOC to feed the graph with the result of this API call.

Setting parameters

We can set parameters within Neo4j Browser with the following syntax:

:params {"repo_name": "neo4j"}

The parameters can then be referred to in later queries with the $repo_name notation:

MATCH (r:Repository {name: $repo_name}) RETURN r

This can be very useful when the parameter is used in multiple places in the query.

In the next section, we will perform HTTP requests to the GitHub API directly from Cypher. You'll need a GitHub token to authenticate and save in as a parameter:

:params {"token": "<your_token>"}

The token is not required but the rate limits for unauthorized requests are much lower so it will be easier to create one by following the instructions here: https://help.github.com/en/github/authenticating-to-github/creating-a-personal-access-token-for-the-command-line#creating-a-token.

Calling the GitHub web API

We can use apoc.load.jsonParams to load a JSON file from a web API, setting the HTTP request headers in the second parameter of this procedure:

CALL apoc.load.json("neo4j_neo4j_contributors_github.json") YIELD value AS item
MATCH (u:User {login: item.login})
CALL apoc.load.jsonParams(item.followers_url, {Authorization: 'Token ' + $token}, null) YIELD value AS contrib
MERGE (f:User {login: contrib.login})
CREATE (f)-[:FOLLOWS]->(u)

When performing the import, I got the following results:

Added 439 labels, created 439 nodes, set 439 properties, created 601 relationships, completed after 12652 ms.

This may vary when you run this since a given user's followers evolve over time. Here is the resulting graph, where users are shown in green and repositories in blue:

You can use any of the provided URLs to enrich your graph, depending on the kind of analysis you want to perform: you can add commits, contributors, issues, and so on.

Summary of import methods

Choosing the right tool to import your data mainly depends on its format. Here are some overall recommendations:

If you only have JSON files, then apoc.load.json is your only option.
If you are using CSV files, then:
If your data is big, use the import tool from the command line.
If your data is small or medium-sized, you can use APOC or Cypher, LOAD CSV.

This closes our section about data imports, where we learned how to feed a Neo4j graph with existing data, from CSV, JSON, or even via a direct call to a web API. We will use those tools all through this book to have meaningful data to run the algorithms on.

Before moving on to those algorithms, a final step is needed. Indeed, as with SQL, there are often several Cypher queries producing the same result, but not all of them have the same efficiency. The next section will show you how to measure efficiency and deal with some good practices to avoid the main caveats.