Insights from GitHub public timeline using Neo4j

Apr 17, 2015

Last month I created my first GraphGist to gather basic insights from GitHub public timeline and make simple recommendations using Neo4j. I latter submitted the GraphGist for the 2015 Neo4j Data Challenge. I was surprised and thrilled to learn today that my GraphGist won in the category “Creative Graph Search and Insights”. I am super excited and can’t wait to spend more time experimenting with Neo4j!

Below are initial sections from the GraphGist. You can visit [Ask GitHub] (http://askgithub.com) and see Neo4j in action by search for repositories and clicking on the button to “find similar repositories” (driven by GrapheneDB)

Data Source

Public GitHub timeline from GitHub Archive is parsed hourly using [node.js streaming parser] (https://github.com/harishvc/githubanalytics/blob/master/bin/FetchParseGitHubArchive.js). Currently event type PushEvent, CreateEvent & WatchEvent are captured. PushEvent contains information about commits and authors. CreateEvent contains new repositories. WatchEvent contains information about popular repositories. All the data is first stored in MongoDB. Data stored in MongoDB is then processed using [Neo4jSync.py] (https://github.com/harishvc/githubanalytics/blob/master/bin/Neo4jSync.py) to generate CSV files and imported into GrapheneDB. This data model will change - Hello Neo4j!

Data Model

Currently there are three types of nodes - Repository, Organization & People. Repository node contains information about repository and when node was created. Organization node contains information about the organization specific repository belongs to and when node was created. People node contains information about contributors (email address of contributors) and when the node was created. IN_ORGANIZATION relationship exists between Respository node and Organization node. IS_ACTOR relationship exists between Respository and People node. There can be more than one person contributing to a repository.

Nodes & Relationships model developed using YUML

Data Model

Screenshot #1: Repositories for organization openstack

Repositories for organization openstack

Screenshot #2: Repository openstack/openstack

Repositories openstack/openstack

Insights

For insights #1 - #6 please visit GraphGist

Tags

  • github
  • analytics
  • neo4j
  • python
  • nodejs
  • graphenedb
Harish Chakravarthy Harish Chakravarthy is an intrapreneur leveraging technology to make a positive difference. Interests include API integration, user experience, data visualization and analytics. Detailed bio.

Connect with Harish on Social Media Github Twitter LinkedIn