Insights from GitHub public timeline using Neo4j
Last month I created my first GraphGist to gather basic insights from GitHub public timeline and make simple recommendations using Neo4j. I latter submitted the GraphGist for the 2015 Neo4j Data Challenge. I was surprised and thrilled to learn today that my GraphGist won in the category “Creative Graph Search and Insights”. I am super excited and can’t wait to spend more time experimenting with Neo4j!
Below are initial sections from the GraphGist. You can visit [Ask GitHub] (http://askgithub.com) and see Neo4j in action by search for repositories and clicking on the button to “find similar repositories” (driven by GrapheneDB)
Public GitHub timeline from GitHub Archive is parsed hourly using [node.js streaming parser] (https://github.com/harishvc/githubanalytics/blob/master/bin/FetchParseGitHubArchive.js).
Currently event type
WatchEvent are captured.
PushEvent contains information about
WatchEvent contains information about popular repositories. All the data is first stored in MongoDB. Data stored in MongoDB is then
processed using [Neo4jSync.py] (https://github.com/harishvc/githubanalytics/blob/master/bin/Neo4jSync.py) to generate CSV files and imported into GrapheneDB.
This data model will change - Hello Neo4j!
Currently there are three types of nodes -
Repository node contains information about repository and when node was created.
Organization node contains information about the organization specific repository belongs to and when node was created.
People node contains information about contributors (email address of contributors) and when the node was created.
IN_ORGANIZATION relationship exists between
Respository node and
IS_ACTOR relationship exists between
People node. There can be more than one person contributing to a repository.
Nodes & Relationships model developed using YUML
Screenshot #1: Repositories for organization
Screenshot #2: Repository
For insights #1 - #6 please visit GraphGist