Hands-On Graph Analytics with Neo4j
上QQ阅读APP看书,第一时间看更新

Recommendation engine

Recommendations are now unavoidable if you work for an e-commerce website. But e-commerce is not the only use case for recommendations. You can also receive recommendations for people you may want to follow on Twitter, meetups you may attend, or repositories you might like knowing about. Knowledge graphs are a good approach to generate those recommendations.

In this section, we are going to use our GitHub graph to recommend to users new repositories they are likely to contribute to or follow. We will explore several possibilities, split into two cases: either your graph contains some social information (users can like or follow each other) or it doesn't. We'll start from the case where you do not have access to any social data since it is the most common one.

Product similarity recommendations

Recommending products, whether we are talking about movies, gardening tools, or meetups, share some common patterns. Here are some common-sense assertions that can lead to a good recommendation:

  • Products in the same categories to a product already bought are more likely to be useful to the user. For instance, if you buy a rake, it probably means you like gardening, so a lawnmower could be of interest to you.
  • There are some products that often get bought together, for instance, printers, ink, and paper. If you buy a printer, it is natural to recommend the ink and paper other users also bought.

We are going to see the implementations of those two approaches using Cypher. We will again use the GitHub graph as a playground. The important parts of its structure are shown in the next schema:

It contains the following entities:

  • Node labels: User, Repository, Language, and Document
  • Relationships:
    • A User node owns or contributes to one or several Repository nodes.
    • A Repository node has one or several Language nodes.
    • A User node can follow another User node.

Thanks to the GitHub API, the USES_LANGUAGE relationship even holds a property quantifying the number of bytes of code using that language.

Products in the same category

In the GitHub graph, we will consider the language as categorizing the repositories. All repositories using Scala will be in the same category. For a given user, we can get the languages used by the repositories they contributed to with the following:

MATCH (:User {login: "boggle"})-[:CONTRIBUTED_TO]->(repo:Repository)-[:USES_LANGUAGE]->(lang:Language)
RETURN lang

If we want to find the other repositories using the same language, we can extend the path from the language node to the other repositories in this way:

MATCH (u:User {login: "boggle"})-[:CONTRIBUTED_TO]->(repo:Repository)-[:USES_LANGUAGE]->(lang:Language)<-[:USES_LANGUAGE]-(recommendation:Repository)
WHERE NOT EXISTS ((u)-[:CONTRIBUTED_TO]->(repo))
RETURN recommendation

For instance, the user boggle contributed to the neo4j repository, which is partly written using Scala. With that technique, we would recommend to this user the repositories neotrients or JUnitSlowTestDiscovery, also using Scala:

However, recommending all repositories using Scala is like recommending all gardening tools because a user bought a rake. It is maybe not accurate enough, especially when the categories contain lots of items. Let's see which other kinds of methods can be used to improve this technique.

Products frequently bought together

One possible solution is to trust your users. Information about their behavior is also valuable.

Consider the pattern in the following diagram:

The user boggle contributed to the repository neo4j. Three more users contributed to it, and also contributed to the repositories parents and neo4j.github.com. Maybe boggle would be interested in contributing to one of those repositories:

MATCH (user:User {login: "boggle"})-[:CONTRIBUTED_TO]->(common_repository:Repository)<-[:CONTRIBUTED_TO]-(other_user:User)-[:CONTRIBUTED_TO]->(recommendation:Repository)
WHERE user <> other_user
RETURN recommendation

We can even group together this method and the preceding one, by selecting only repositories using a language the user knows and with at least one common contributor:

MATCH (user:User {login: "boggle"})-[:CONTRIBUTED_TO]->(common_repository:Repository)<-[:CONTRIBUTED_TO]-(other_user:User)-[:CONTRIBUTED_TO]->(recommendation:Repository)
MATCH (common_repository)-[:USES_LANGUAGE]->(:Language)<-[:USES_LANGUAGE]-(recommendation)
WHERE user <> other_user
RETURN recommendation

When having only a few matches, we can afford to display all returned items. But if your database grows, you will find a lot of possible recommendations. In that case, finding a way to rank the recommended items would be essential.

Recommendation ordering

If you look again at the preceding image, you can see that the repository neo4j.github.com is shared between two people, while the parents repository would be recommended by only one person. This information can be used to rank the recommendations. The corresponding Cypher query would be as follows:

MATCH (user:User {login: "boggle"})-[:CONTRIBUTED_TO]->(common_repository:Repository)<-[:CONTRIBUTED_TO]-(other_user:User)-[:CONTRIBUTED_TO]->(recommendation:Repository)
WHERE user <> other_user
WITH recommendation, COUNT(other_user) as reco_importance
RETURN recommendation
ORDER BY reco_importance DESC
LIMIT 5

The new WITH clause is introduced to perform the aggregation: for each possible recommended repositories, we count how many users would recommend it.

This is the first way of using user data to provide accurate recommendations. Another way is, when possible, to take into account using social relationships, as we will see now.

Social recommendations

If your knowledge graph contains data related to social links between users, like GitHub or Medium does, a brand new field of recommendations is open to you. Because you know which person a given user likes or follows, you can have a better idea about which type of content this user is likely to appreciate. For instance, if someone you follow on Medium claps a story, it is much more likely you will also like it, compared to any other random story you can find on Medium.

Luckily, we have some social data in our GitHub knowledge graph, through the FOLLOWS relationships. So will use this information to provide other recommendations to our users.

Products bought by a friend of mine

If we want to recommend new repositories to our GitHub users, we can think of the following rule: repositories of a user I follow are more likely to be of interest to me, otherwise I wouldn't follow those users. We can use Cypher to identify those repositories:

MATCH (u:User {login: "mkhq"})-[:FOLLOWS]->(following:User)-[:CONTRIBUTED_TO]->(recommendation:Repository)
WHERE NOT EXISTS ((u)-[:CONTRIBUTED_TO]->(recommendation))
RETURN DISTINCT recommendation

This query matches patterns similar to the following one:

We can also use recommendation ordering here. The higher the number of people I follow that also contributed to a given repository, the higher the probability that I will also contribute to it. This translates into Cypher in the following way:

MATCH (u:User {login: "mkhq"})-[:FOLLOWS]->(following:User)-[:CONTRIBUTED_TO]->(recommendation:Repository)
WHERE NOT EXISTS ((u)-[:CONTRIBUTED_TO]->(recommendation))
WITH user, recommendation, COUNT(following) as nb_following_contributed_to_repo
RETURN recommendation
ORDER BY nb_following_contributed_to_repo DESC
LIMIT 5

The first part of the query is exactly the same as the previous one, while the second part is similar to the query we wrote in the previous section: for each possible recommendation, we count how many users mkhg is following would recommend it.

We have seen several ways of finding recommendations based on pure Cypher. They can be extended depending on your data: the more information you have about your products and customers, the more precise the recommendations can be. In the following chapters, we will discover algorithms to create clusters of nodes within the same community. This concept of community can also be used in the context of recommendations, assuming users within the same community are more likely to like or buy the same products. More details will be given in Chapter 7, Community Detection and Similarity Measures.