With the world becoming connected through social media platforms and the use of internet more extensive than ever, there has been a massive explosion of data and content. This has led to an unprecedented need for scaling solutions that store and process data. The traditional approach to address such data growth is to buy progressively more powerful hardware until the database can serve all the traffic.
Even larger companies have dealt with the horror of this scalability as they resorted to using traditional relational databases, eventually hitting limits that were unviable both financially and operationally. Google once ran off of 40,000 MySQL installations and Facebook was at one point spending $1M per month for specialized database hardware to serve their pictures. These unviable solutions led to a re-evaluation of existing database technologies and led to the Not-Only-SQL (NoSQL) movement.
The NoSQL movement is characterized by a move away from traditional SQL based solutions in certain storage and processing situations. The fundamental reasons behind this are simple:
- Not all your applications need relationships between data to be explicitly stored along with the data
- Moving the responsibility of interpreting data relationships out of the store allows for storing the data anywhere. That means you can simply add boxes as your data grows
- Storing your data as a set of key-value pairs as opposed to hard wired tables allows for more flexibility in storage design
- Absence of a pre-defined schema/tables allows you to easily extend your data model
- Data operations, both read and write, tend to be simple and lightweight in the absence of SQL processing and explicit relationship constraints
By moving the responsibility of access control, correlating related data, conflict resolution and maintaining integrity constraints to a programmatic layer separate from the database, NoSQL engines are able to achieve exceptional performance and scalability. Due to this separation of responsibilities, significant design skills are required to build solutions using NoSQL. A clear understanding of data management and data storage concerns is also needed. Industry evidence shows the returns are well worth the effort especially for applications that deal with explosive data volumes as their single biggest challenge.
NoSQL engines provide two fundamental infrastructure pieces:
- A way for you to distribute your data across different nodes; in other words, a distributed storage system
- A parallel processing solution that allows you to break up your data processing across different nodes
Most NoSQL databases use a high throughput distributed file system for storage and the MapReduce programming model developed at Google as their distributed parallel processing framework.
Conceptually, MapReduce is a simple approach based on organizing your data as key-value pairs and modeling the processing as a set of map-reduce operations. Say, for example, counting the number of occurrences of each word in a large collection of documents – a map function would be responsible for computing the frequency (‘value’) of each word (‘key’) at a document level and a reduce function would be responsible for adding all frequencies for each word. The input data, intermediate results, and the final results are stored in a distributed store. The advantage of MapReduce is that it allows you to distribute the processing of map and reduce jobs in parallel across multiple processing units. It’s obvious by now that MapReduce is geared more towards batch or exhaustive processing and not as much towards the interactive processing that characterize most business applications.
Over the last few years, several commercial and open source NoSQL implementations have seen wide-spread industry adoption. Popular NoSQL implementations include Hadoop HDFS, CouchDB, MongoDB, Cassandra, and Amazon Dynamo. Their adoption ranges from big ‘new economy’ names like LinkedIn, Facebook, Twitter, EBay, Google, Yahoo and Rackspace to traditional businesses like LexisNexis, New York Times, Chicago Tribune, Intuit, Hulu, and Fox. Google eventually replaced its 40,000 MySQL boxes with a MapReduce/HDFS based implementation. Facebook moved away from using specialized database hardware to a distributed database solution powered by Cassandra.
Fundamentally, because NoSQL is a solution for high volume data processing, I see it getting a lot of traction in the business intelligence space where intelligence is now being mined not just from structured data in tables and columns but also from unstructured data such as activity logs.
The gaining momentum of enterprise level adoption to NoSQL databases opens up a wide range of possibilities to resolve existing problems, which were probably hard to tackle with SQL based technologies.