Knowledge graphs_Hands-On Graph Analytics with Neo4j-QQ阅读中文玄幻网

上QQ阅读APP看书，第一时间看更新

Knowledge graphs

If you have followed Neo4j news for the last few years, you have probably heard a lot about knowledge graphs. But it is not always clear what they are. Unfortunately, there is no universal definition of a knowledge graph, but let's try to understand which concepts are hidden behind these two words.

Attempting a definition of knowledge graphs

Modern applications produce petabytes of data every day. As an example, during the year 2019, every minute, the number of Google searches has been estimated to be more than 4.4 billion. During the same amount of time, 180 billion emails, and more than 500,000 tweets are sent, while the number of videos watched on YouTube is about 4.5 billion. Organizing this data and transforming it into knowledge is a real challenge.

Knowledge graphs try to address this challenge by storing the following in the same data structure:

Entities related to a specific field, such as users or products
Relationships between entities, for instance, user A bought a surfboard
Context to understand the previous entities and relationships, for instance, user A lives in Hawaii and is a surf teacher

Graphs are the perfect structure to store all this information since it is very easy to aggregate data from different data sources: we just have to create new nodes (with maybe new labels) and the relationships. There is no need to update the existing nodes.

Those graphs can be used in many ways. We can, for instance, distinguish the following:

Business knowledge graph: You can build such a graph to address some specific tasks within your enterprise, such as providing fast and accurate recommendations to your customers.
Enterprise knowledge graph: To go even beyond the business knowledge graph, you can build a graph whose purpose is to support multiple units in the enterprise.
Field knowledge graph: This goes further and gathers all information about a specific area such as medicine or sport.

Since 2019, knowledge graphs even have their own conference organized by the University of Columbia in New York. You can browse the past events' recordings and learn more about how organizations use knowledge graphs to empower their business at https://www.knowledgegraph.tech/.

In the rest of this section, we will learn how to build a knowledge graph in practice. We will study several ways:

Structured data: Such data can come from a legacy database such as SQL.
Unstructured data: This covers textual data that we will analyze using NLP techniques.
Online knowledge graphs, especially Wikidata (https://www.wikidata.org).

Let's start with the structured data case.

Building a knowledge graph from structured data

A knowledge graph is then nothing more than a graph database, with well-known relationships between entities.

We have actually already started building a knowledge graph in Chapter 2, Cypher Query Language. Indeed, the graph we built there contains the Neo4j-related repositories and users on GitHub: it is a representation of the knowledge we have regarding Neo4j ecosystem.

So far, the graph only contains two kinds of information:

The list of repositories owned by the Neo4j organization
The list of contributors to each of these repositories

But our knowledge can be extended much beyond this. Using the GitHub API, we can go deeper and, for instance, gather the following:

The list of repositories owned by each contributor to Neo4j, or the list of repositories they contributed to
The list of tags assigned to each repository
The list of users each of these contributors follow
The list of users following each contributor

For example, let's import each repository contributor and their owned repositories in one single query:

MATCH (u:User)-[:OWNS]->(r:Repository)
CALL apoc.load.jsonParams("https://api.github.com/repos/" + u.login + "/" + r.name + "/contributors", {Authorization: 'Token ' + $token}, null) YIELD value AS item
MERGE (u2:User {login: item.login})
MERGE (u2)-[:CONTRIBUTED_TO]->(r)
WITH item, u2
CALL apoc.load.jsonParams(item.repos_url, {Authorization: 'Token ' + $token}, null) YIELD value AS contrib
MERGE (r2:Repository {name: contrib.name})
MERGE (u2)-[:OWNS]->(r2)

Due to the reduced rate limit on the GitHub API, this query will fail if you are not using a GitHub token.

You can play around and extend your knowledge graph about the Neo4j community on GitHub. In the following sections, we will learn how to use NLP to extend this graph and extract information from the project's README file.

The preceding query uses the data from neo4j_repos_github.json we imported in the preceding chapter. Moreover, since it sends one request per user per repository, it can take some time to complete (around 5 min).

Building a knowledge graph from unstructured data using NLP

NLP is the part of machine learning whose goal is to understand natural language. In other words, the holy grail of NLP is to make computers answer questions such as "What's the weather like today?"

NLP

In NLP, researchers and computer scientists try to make a computer understand an English (or any other human language) sentence. The result of their hard work can be seen in many modern applications, such as the voice assistants Apple Siri or Amazon Alexa.

But before going into such advanced systems, NLP can be used to do the following:

Perform sentiment analysis: Is a comment about a specific brand positive or negative?
Named Entity Recognition (NER): Can we extract the name of people or locations contained within a given text, without having to list them all in a regex pattern?

These two questions, quite easy for a human being, are incredibly hard for a machine. The models used to achieve very good results are beyond the scope of this book, but you can refer to the Further reading section to learn more about them.

In the next section, we are going to use pre-trained models provided by the NLP research group from Stanford University, which provides state-of-the-art results, at https://stanfordnlp.github.io/.

Neo4j tools for NLP

Even if not officially supported by Neo4j, community members and companies using Neo4j provide some interesting plugins. One of them was developed by the GraphAware company and enables Neo4j users to use Stanford tools for NLP within Neo4j. That's the library we will use in this section.

GraphAware NLP library

If you are interested in the implementation and more detailed documentation, the code is available at https://github.com/graphaware/neo4j-nlp.

To install this package, you'll need to visit https://products.graphaware.com/ and download the following JAR files:

framework-server-community (if using Neo4j community edition) or framework-server-enterprise if using the Enterprise edition
nlp
nlp-stanford-nlp

You also need to download trained models from Stanford Core NLP available at https://stanfordnlp.github.io/CoreNLP/#download. In this book, you will only need the models for the English language.

After all those JAR files are downloaded, you need to copy them into the plugins directory of the GitHub graph we started building in Chapter 2, Cypher Query Language. Here is the list of JAR files that you should have downloaded and that will be needed to run the code in this chapter:

apoc-3.5.0.6.jar
graphaware-server-community-all-3.5.11.54.jar
graphaware-nlp-3.5.4.53.16.jar
nlp-stanfordnlp-3.5.4.53.17.jar
stanford-english-corenlp-2018-10-05-models.jar

Once those JAR files are in your plugins directory, you have to restart the graph. To check that everything is working fine, you can check that GraphAware NLP procedures are available with the following query:

CALL dbms.procedures() YIELD name, signature, description, mode
WHERE name =~ 'ga.nlp.*'
RETURN signature, description, mode
ORDER BY name

You will see the following lines:

The last step before starting using the NLP library is to update some settings in neo4j.conf. First, trust the procedures from ga.nlp. and tell Neo4j where to look for the plugin:

dbms.security.procedures.unrestricted=apoc.*,ga.nlp.*
dbms.unmanaged_extension_classes=com.graphaware.server=/graphaware

Then, add the following two lines, specific to the GraphAware plugin, in the same neo4j.conf file:

com.graphaware.runtime.enabled=true
com.graphaware.module.NLP.1=com.graphaware.nlp.module.NLPBootstrapper

After restarting the graph, your working environment is ready. Let's import some textual data to run the NLP algorithms on.

Importing test data from the GitHub API

As test data, we will use the content of the README for each repository in our graph, and see what kind of information can be extracted from it.

The API to get the README from a repository is the following:

GET /repos/<owner>/<repo>/readme

Similarly to what we have done in the previous chapter, we are going to use apoc.load.jsonParams to load this data into Neo4j. First, we set our GitHub access token, if any (optional):

:params {"token": "8de08ffe137afb214b86af9bcac96d2a59d55d56"}

Then we can run the following query to retrieve the README of all repositories in our graph:

MATCH (u:User)-[:OWNS]->(r:Repository)
CALL apoc.load.jsonParams("https://api.github.com/repos/" + u.login + "/" + r.name + "/readme", {Authorization: "Token " + $token}, null, null, {failOnError: false}) YIELD value
CREATE (d:Document {name: value.name, content:value.content, encoding: value.encoding})
CREATE (d)-[:DESCRIBES]->(r)

Similarly to the previous query to fetch data from the GitHub API, the execution time of this query can be quite long (up to more than 15 minutes).

You will notice from the preceding query that we added a parameter {failOnError: false} to prevent APOC from raising an exception when the API returns a status code different from 200. This is the case for the https://github.com/neo4j/license-maven-plugin repository, which does not have any README file.

Checking the content of our new document nodes, you will realize that the content is base64 encoded. In order to use NLP tools, we will have to decode it. Happily, APOC provides a procedure for that. We just need to clean our data and remove line breaks from the downloaded content and invoke apoc.text.base64Decode as follows:

MATCH (d:Document)
SET d.text = apoc.text.base64Decode(apoc.text.join(split(d.content, "\n"), ""))
RETURN d

If you are not using the default dbms.security.procedures.whitelist parameter in neo4j.conf, you will need to whitelist the apoc.text procedure for the previous query to work:
dbms.security.procedures.whitelist=apoc.text.*

Our document nodes now have a human-readable text property, containing the content of the README. Let's now see how to use NLP to learn more about our repositories.

Enriching the graph with NLP

In order to use GraphAware tools, the first step is to build an NLP pipeline:

CALL ga.nlp.processor.addPipeline({
 name:"named_entity_extraction",
 textProcessor: 'com.graphaware.nlp.processor.stanford.StanfordTextProcessor',
 processingSteps: {tokenize:true, ner:true}
})

Here, we specify the following:

The pipeline name, named_entity_extraction.
The text processor to be used. GraphAware supports both Stanford NLP and OpenNLP; here, we are using Stanford models.
The processing steps:
Tokenization: Extract tokens from a text. As a first approximation, a token can be seen as a word.
NER: This is the key step that will identify named entities such as persons or locations.

We can now run this pipeline on the README text by calling the ga.nlp.annotate procedure as follows:

MATCH (n:Document)
CALL ga.nlp.annotate({text: n.text, id: id(n), checkLanguage: false, pipeline : "named_entity_extraction"}) YIELD result
MERGE (n)-[:HAS_ANNOTATED_TEXT]->(result)

This procedure will actually update the graph and add nodes and relationships to it. The resulting graph schema is displayed here, with only some chosen nodes and relationships to make it more readable:

We can now check which people were identified within our repositories:

MATCH (n:NER_Person) RETURN n.value

Part of the result of this query is displayed here:

╒════════════════╕
│"n.value"       │
╞════════════════╡
│"Keanu Reeves"  │
├────────────────┤
│"Arthur"        │
├────────────────┤
│"Bob"           │
├────────────────┤
│"James"         │
├────────────────┤
│"Travis CI"     │
├────────────────┤
│"Errorf"        │
└────────────────┘

You can see that, despite some errors with Errorf or Travis CI identified as people, the NER was able to successfully identify Keanu Reeves and other anonymous contributors.

We can also identify which repository Keanu Reeves was identified in. According to the preceding graph schema, the query we have to write is the following:

MATCH (r:Repository)<-[:DESCRIBES]-(:Document)-[:HAS_ANNOTATED_TEXT]->(:AnnotatedText)-[:CONTAINS_SENTENCE]->(:Sentence)-[:HAS_TAG]->(:NER_Person {value: 'Keanu Reeves'})
RETURN r.name

This query returns only one result: neo4j-ogm. This actor name is actually used within this README, for the version I downloaded (you can have different results here since README changes with time).

NLP is a fantastic tool to extend knowledge graphs and bring structure from unstructured textual data. But there is another source of information that we can also use to enhance a knowledge graph. Indeed, some organizations such as the Wikimedia foundation give access to their own knowledge graph. We will learn in the next section how to use the Wikidata knowledge graph to add even more context to our data.

Adding context to a knowledge graph from Wikidata

Wikidata defines itself with the following words:

Wikidata is a free and open knowledge base that can be read and edited by both humans and machines.

In practice, a Wikidata page, like the one regarding Neo4j (https://www.wikidata.org/wiki/Q1628290) contains a list of properties such as programming language or official website.

Introducing RDF and SPARQL

Wikidata structure actually follows the Resource Description Framework (RDF). Part of the W3C specifications since 1999, this format allows us to store data as triples:

(subject, predicate, object)

For instance, the sentence Homer is the father of Bart is translated with RDF format as follows:

(Homer, is father of, Bart)

This RDF triple can be written with a syntax closer to Cypher:

(Homer) - [IS_FATHER] -> (Bart)

RDF data can be queried using the SPARQL query language, also standardized by the W3C.

The following will teach you how to build simple queries against Wikidata.

Querying Wikidata

All the queries we are going to write here can be tested using the online Wikidata tool at https://query.wikidata.org/.

If you have done the assessments at the end of Chapter 2, Cypher Query Language, your GitHub graph must have nodes with label Location, containing the city each user is declared to live in. If you skipped Chapter 2 or the assessment, you can find this graph in the GitHub repository for this chapter. The current graph schema is the following:

Our goal will be to assign a country to each of the locations. Let's start from the most frequent location within Neo4j contributors, Malmö. This is a city in Sweden where the company building and maintaining Neo4j, Neo Inc., has its main offices.

How can we find the country in which Malmö is located using Wikidata? We first need to find the page regarding Malmö on Wikidata. A simple search on your favorite search engine should lead you to https://www.wikidata.org/wiki/Q2211. From there, two pieces of information are important to note:

The entity identifier in the URL: Q2211. For Wikidata, Q2211 means Malmö.
If you scroll down on the page, you will find the property, country, which links to a Property page for property P17: https://www.wikidata.org/wiki/Property:P17.

With these two pieces of information, we can build and test our first SPARQL query:

SELECT ?country 
WHERE { 
   wd:Q2211 wdt:P17 ?country .

Notice the final dot in the WHERE block. It is very important in SPARQL and marks the end of the sentence.

This query, with Cypher words, would read: starting from the entity whose identifier is Q2211 (Malmö), follow the relationship with type P17 (country), and return the entity at the end of this relationship. To go further with the comparison to Cypher, the preceding SPARQL query could be written in Cypher as follows:

MATCH (n {id: wd:Q2211})-[r {id: wdt:P17}]->(country)
RETURN country

So, if you run the preceding SPARQL in the Wikidata online shell, you will get a result like wd:Q34, with a link to the Sweden page in Wikidata. So that's great, it works! However, if we want to automatize this treatment, having to click on a link to get the country name is not very convenient. Happily, we can get this information directly from SPARQL. The main difference compared to the previous query is that we have to specify in which language we want the result back. Here, I forced the language to be English:

SELECT ?country ?countryLabel 
WHERE { 
 wd:Q2211 wdt:P17 ?country .
 SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } 
}

Executing this query, you now also get the country name, Sweden, as a second column of the result.

Let's go even further. To get the city identifier, Q2211, we had to first search Wikidata and manually introduce it in the query. Can't SPARQL perform this search for us? The answer, as expected, is yes, it can:

SELECT ?city ?cityLabel ?countryLabel WHERE { 
  ?city rdfs:label "Malmö"@en . 
  ?city wdt:P17 ?country . 
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } 
}

Instead of starting from a well-known entity, we start by performing a search within Wikidata to find the entities whose label, in English, is Malmö.

However, you'll notice that running this query now returns three rows, all having Malmö as city label, but two of them are in Sweden and the last one is in Norway. If we want to select only the Malmö we are interested in, we will have to narrow down our query and add more criteria. For instance, we can select only big cities:

SELECT ?city ?cityLabel ?countryLabel WHERE { 
  ?city rdfs:label "Malmö"@en;
         wdt:P31 wd:Q5119 .
  ?city wdt:P17 ?country .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } 
}

In this query, we see the following:

P31 means instance of.
Q1549591 is the identifier for big city.

So the preceding bold statement, translated to English, could be read as follows:

Cities 
whose label in English is "Malmö" 
AND that are instances of "big city"

Now we only select one Malmö in Sweden, which is the Q2211 entity we identified at the beginning of this section.

Next, let's see how to use this query result to extend our Neo4j knowledge graph.

Importing Wikidata into Neo4j

In order to automatize data import into Neo4j, we will use the Wikidata query API:

GET https://query.wikidata.org/sparql?format=json&query={SPARQL}

Using the format=json is not mandatory but it will force the API to return a JSON result instead of the default XML; it is a matter of personal preference. In that way, we will also be able to use the apoc.load.json procedure to parse the result and create Neo4j nodes and relationships depending on our needs. Note that if you are used to XML and prefer to manipulate this data format, APOC also has a procedure to import XML into Neo4j: apoc.load.xml.

The second parameter of the Wikidata API endpoint is the SPARQL query itself, such as the ones we have written in the previous section. We can run the query to ask for the country and country label of Malmö (entity Q2211):

https://query.wikidata.org/sparql?format=json&query=SELECT ?country ?countryLabel WHERE {wd:Q2211 wdt:P17 ?country . SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }}

The resulting JSON that you can directly see in your browser is the following:

{
  "head": {
    "vars": [
      "country",
      "countryLabel"
    ]
  },
 "results": {
    "bindings": [
      {
        "country": {
          "type": "uri",
          "value": "http://www.wikidata.org/entity/Q34"
        },
        "countryLabel": {
          "xml:lang": "en",
          "type": "literal",
          "value": "Sweden"
        }
      }
    ]
  }
}

If we want to handle this data with Neo4j, we can copy the result into the wikidata_malmo_country_result.json file (or download this file from the GitHub repository of this book), and use apoc.load.json to access the country name:

CALL apoc.load.json("wikidata_malmo_country_result.json") YIELD value as item
RETURN item.results.bindings[0].countryLabel.value

Remember to put the file to be imported inside the import folder of your active graph.

But, if you remember from Chapter 2, Cypher Query Language, APOC also has the ability to perform API calls by itself. It means that the two steps we've just followed – querying Wikidata and saving the result in a file, and importing this data into Neo4j – can be merged into a single step in the following way:

WITH 'SELECT ?countryLabel WHERE {wd:Q2211 wdt:P17 ?country. SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }}' as query
CALL apoc.load.jsonParams('http://query.wikidata.org/sparql?format=json&query=' + apoc.text.urlencode(query), {}, null) YIELD value as item
RETURN item.results.bindings[0].countryLabel.value

Using a WITH clause here is not mandatory. But if we want to run the preceding query for all Location nodes, it is convenient to use such a syntax:

MATCH (l:Location) WHERE l.name <> ""
WITH l, 'SELECT ?countryLabel WHERE { ?city rdfs:label "' + l.name + '"@en. ?city wdt:P17 ?country. SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } }' as query
CALL apoc.load.jsonParams('http://query.wikidata.org/sparql?format=json&query=' + apoc.text.urlencode(query), {}, null) YIELD value as item
RETURN l.name, item.results.bindings[0].countryLabel.value as country_name

This returns a result like the following:

╒═════════════╤═════════════════════════════╕
│"l.name"     │"country_name"               │
╞═════════════╪═════════════════════════════╡
│"Dresden"    │"Germany"                    │
├─────────────┼─────────────────────────────┤
│"Beijing"    │"People's Republic of China" │
├─────────────┼─────────────────────────────┤
│"Seoul"      │"South Korea"                │
├─────────────┼─────────────────────────────┤
│"Paris"      │"France"                     │
├─────────────┼─────────────────────────────┤
│"Malmö"      │"Sweden"                     │
├─────────────┼─────────────────────────────┤
│"Lund"       │"Sweden"                     │
├─────────────┼─────────────────────────────┤
│"Copenhagen" │"Denmark"                    │
├─────────────┼─────────────────────────────┤
│"London"     │"United Kingdom"             │
├─────────────┼─────────────────────────────┤
│"Madrid"     │"Spain"                      │
└─────────────┴─────────────────────────────┘

This result can then be used to create new country nodes with a relationship between the city and the identified country in this way:

MATCH (l:Location) WHERE l.name <> ""
WITH l, 'SELECT ?countryLabel WHERE { ?city rdfs:label "' + l.name + '"@en. ?city wdt:P17 ?country. SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } }' as query
CALL apoc.load.jsonParams('http://query.wikidata.org/sparql?format=json&query=' + apoc.text.urlencode(query), {}, null) YIELD value as item
WITH l, item.results.bindings[0].countryLabel.value as country_name
MERGE (c:Country {name: country_name})
MERGE (l)-[:LOCATED_IN]->(c)

Our knowledge graph of the Neo4j community on GitHub has been extended thanks to the free online Wikidata resources.

Note that if you have to manage large RDF datasets, the neosemantics extension of Neo4j is the way to go instead of APOC:
https://github.com/neo4j-labs/neosemantics

The method we used to extract the city from the user-defined location from GitHub is full of broad approximations and the result is often not really accurate. We used this for teaching purposes, but in a real-life scenario, we would rather use some kind of geocoding service such as the one provided by Google or Open Street Map to get a normalized location from a free text user input.

If you navigate through Wikidata, you will see there are many other possibilities for extensions. It does not only contain information about persons and locations but also about some common words. As an example, you can search for rake, and you will see that it is classified as an agricultural tool used by farmers and gardeners that can be made out of plastic or steel or wood. The amount of information stored there, in a structured way, is incredible. But there are even more ways to extend a knowledge graph. We are going to take advantage of another source of data: semantic graphs.

Enhancing a knowledge graph from semantic graphs

If you had the curiosity to read the documentation of the GraphAware NLP package, you have already seen the procedures we are going to use now: the enrich procedure.

This procedure uses the ConceptNet graph, which relates words together with different kinds of relationships. We can find synonyms and antonyms but also created by or symbol of relationships. The full list is available at https://github.com/commonsense/conceptnet5/wiki/Relations.

Let's see ConceptNet in action. For this, we first need to select a Tag which is the result of the GraphAware annotate procedure we used previously. For this example, I will use the Tag corresponding to the verb "make" and look for its synonyms. The syntax is the following:

MATCH (t:Tag {value: "make"})
CALL ga.nlp.enrich.concept({tag: t, depth: 1, admittedRelationships:["Synonym"]}

The admittedRelationships parameter is a list of relationships as defined in ConceptNet (check the preceding link). The procedure created new tags, and relationships of type IS_RELATED_TO between the new tags and the original one, "make". We can visualize the result easily with this query:

MATCH (t:Tag {value: "make"})-[:IS_RELATED_TO]->(n)
RETURN t, n

The result is shown in the following diagram. You can see that ConceptNet knows that produce, construct, create, cause, and many other verbs are synonyms of make:

This information is very useful, especially when trying to build a system to understand the user intent. That's the first use case for knowledge graphs we are going to investigate in the next section: graph-based search.