Learn, unlearn, and relearn - Hello NoSQL Databases!
I started working on Ask GitHub few months ago to search latest GitHub public timeline to answer questions and identify basic insights. Along the way I wanted to learn and experiment with new technologies.
The Technical Challenge & Solution
During a 24 hour period the GitHub public timeline on average includes 500K commits, 25K new repositories, 30K starred repositories, 100k contributors, adding up to more than 1GB of data. Storing and accessing this data is vital. What are my options?
After some initial trial with different databases my options were clear - MongoDB & Neo4j. Big data ready, cluster friendly, flexible schema, simplicity with storage, processing and managing large streams of non-transactional data and an active user community were the driving factors for implementing a NoSQL database. By taking advantage of the strengths of different NoSQL data storage solutions provide Ask GitHub searches thousands of documents, builds relations and offers visitors interesting perspectives.
Shift in Mindset
Learning to code for a NoSQL databases requires a shift in mindshift - unlearn the traditional relational databases driven by several tables, keys and join and relearn about storing data as documents and writing aggregation operations to process the data for MongoDB, creating nodes and building relations between nodes for Neo4j. It took me a few days to unlearn and relearn.
Aggregate operation to find trending repositories based on stargazers. Code in Python.
- Result is grouped by
full_nameand counted as
- Result is sorted by
starsin decending order
- Result is then limited to 10
Do you see the simplicity and power of the aggregation operation?
Cypher to count repositories that have a relation with an organization. Code in Python.
Here the cypher matches nodes of type
Repository that have a relation with nodes of type
Organization and counts them.
Do you see the simplicity and power of cypher?