Simple recommendation engine using Neo4j

Jul 02, 2015

Building a simple recommendation engine to recommend repositories based on contributors and organizations.

Data Source

Public GitHub timeline from GitHub Archive is parsed hourly using [node.js streaming parser] (https://github.com/harishvc/githubanalytics/blob/master/bin/FetchParseGitHubArchive.js). Currently event type PushEvent, CreateEvent & WatchEvent are captured. PushEvent contains information about commits and authors. CreateEvent contains new repositories. WatchEvent contains information about popular repositories. All the data is first stored in MongoDB. Data stored in MongoDB is then processed using [Neo4jSync.py] (https://github.com/harishvc/githubanalytics/blob/master/bin/Neo4jSync.py) to generate CSV files and imported into Neo4j (hosted on GrapheneDB).

Data Model

Currently there are three types of nodes - Repository, Organization & People. Repository node contains information about repository and when node was created. Organization node contains information about the organization specific repository belongs to and when node was created. People node contains information about contributors (email address of contributors) and when the node was created. IN_ORGANIZATION relationship exists between Respository node and Organization node. IS_ACTOR relationship exists between Respository and People node. There can be more than one person contributing to a repository.

Nodes & Relationships model developed using YUML

Data Model