Simple recommendation engine using Neo4j

Jul 02, 2015

Building a simple recommendation engine to recommend repositories based on contributors and organizations.

Data Source

Public GitHub timeline from GitHub Archive is parsed hourly using [node.js streaming parser] (https://github.com/harishvc/githubanalytics/blob/master/bin/FetchParseGitHubArchive.js). Currently event type PushEvent, CreateEvent & WatchEvent are captured. PushEvent contains information about commits and authors. CreateEvent contains new repositories. WatchEvent contains information about popular repositories. All the data is first stored in MongoDB. Data stored in MongoDB is then processed using [Neo4jSync.py] (https://github.com/harishvc/githubanalytics/blob/master/bin/Neo4jSync.py) to generate CSV files and imported into Neo4j (hosted on GrapheneDB).

Data Model

Currently there are three types of nodes - Repository, Organization & People. Repository node contains information about repository and when node was created. Organization node contains information about the organization specific repository belongs to and when node was created. People node contains information about contributors (email address of contributors) and when the node was created. IN_ORGANIZATION relationship exists between Respository node and Organization node. IS_ACTOR relationship exists between Respository and People node. There can be more than one person contributing to a repository.

Nodes & Relationships model developed using YUML

Data Model

Simple Recommendation Engine

Cypher to find similar based on contributors and organizations. For repository edx/edx-platform the cypher query finds all repositories that share the relation IS_ACTOR or IN_ORGANIZATION . The result is then sorted by number of connections in descending order.

Output Here repository edx/edx-platform and amir-qayyum-arbisoft/edx-platform share 25 contributors. Isn’t it amazing!

Future Work

Additional types of nodes can be created to improve the recommendations. The cypher query could be further optimized.

Tags

  • nosql
  • github
  • neo4j
Harish Chakravarthy Harish Chakravarthy is an intrapreneur leveraging technology to make a positive difference. Interests include API integration, user experience, data visualization and analytics. Detailed bio.

Connect with Harish on Social Media Github Twitter LinkedIn