Simple recommendation engine using Neo4j
Building a simple recommendation engine to recommend repositories based on contributors and organizations.
Data Source
Public GitHub timeline from GitHub Archive is parsed hourly using [node.js streaming parser] (https://github.com/harishvc/githubanalytics/blob/master/bin/FetchParseGitHubArchive.js).
Currently event type PushEvent
, CreateEvent
& WatchEvent
are captured. PushEvent
contains information about commits
and authors
. CreateEvent
contains
new repositories. WatchEvent
contains information about popular repositories. All the data is first stored in MongoDB. Data stored in MongoDB is then
processed using [Neo4jSync.py] (https://github.com/harishvc/githubanalytics/blob/master/bin/Neo4jSync.py) to generate CSV files and imported into Neo4j (hosted on GrapheneDB).
Data Model
Currently there are three types of nodes - Repository
, Organization
& People
. Repository
node contains information about repository and when node was created. Organization
node contains information about the organization specific repository belongs to and when node was created. People
node contains information about contributors (email address of contributors) and when the node was created. IN_ORGANIZATION
relationship exists between Respository
node and Organization
node. IS_ACTOR
relationship exists between Respository
and People
node. There can be more than one person contributing to a repository.
Nodes & Relationships model developed using YUML
Simple Recommendation Engine
Cypher to find similar based on contributors and organizations.
For repository
edx/edx-platform
the cypher query finds all repositories that share the relation IS_ACTOR
or IN_ORGANIZATION
. The result is then sorted by number of connections in descending order.
Output
Here repository
edx/edx-platform
and amir-qayyum-arbisoft/edx-platform
share 25 contributors. Isn’t it amazing!
Future Work
Additional types of nodes can be created to improve the recommendations. The cypher query could be further optimized.
Related Articles
Tags
- nosql
- github
- neo4j