Insights from GitHub public timeline using Neo4j
Last month I created my first GraphGist to gather basic insights from GitHub public timeline and make simple recommendations using Neo4j. I latter submitted the GraphGist for the 2015 Neo4j Data Challenge. I was surprised and thrilled to learn today that my GraphGist won in the category “Creative Graph Search and Insights”. I am super excited and can’t wait to spend more time experimenting with Neo4j!
Below are initial sections from the GraphGist. You can visit [Ask GitHub] (http://askgithub.com) and see Neo4j in action by search for repositories and clicking on the button to “find similar repositories” (driven by GrapheneDB)
Data Source
Public GitHub timeline from GitHub Archive is parsed hourly using [node.js streaming parser] (https://github.com/harishvc/githubanalytics/blob/master/bin/FetchParseGitHubArchive.js).
Currently event type PushEvent
, CreateEvent
& WatchEvent
are captured. PushEvent
contains information about commits
and authors
. CreateEvent
contains
new repositories. WatchEvent
contains information about popular repositories. All the data is first stored in MongoDB. Data stored in MongoDB is then
processed using [Neo4jSync.py] (https://github.com/harishvc/githubanalytics/blob/master/bin/Neo4jSync.py) to generate CSV files and imported into GrapheneDB.
This data model will change - Hello Neo4j!
Data Model
Currently there are three types of nodes - Repository
, Organization
& People
. Repository
node contains information about repository and when node was created. Organization
node contains information about the organization specific repository belongs to and when node was created. People
node contains information about contributors (email address of contributors) and when the node was created. IN_ORGANIZATION
relationship exists between Respository
node and Organization
node. IS_ACTOR
relationship exists between Respository
and People
node. There can be more than one person contributing to a repository.
Nodes & Relationships model developed using YUML
Screenshot #1: Repositories for organization openstack
Screenshot #2: Repository openstack/openstack
Insights
For insights #1 - #6 please visit GraphGist
Tags
- github
- analytics
- neo4j
- python
- nodejs
- graphenedb