A Few Words About 'NoSQL' and Other Unstructured Databases

Carlo Strozzi coined the term NoSQL (“Not only SQL”) in 1998, referring to a lightweight database that did not expose a SQL interface. In 2009, Eric Evans of Rackspace dredged up the term and its meaning was still being debated by its coiner while organizing an event with Johan Oskarsson of Last.fm to discuss the growing number of non-relational distributed data stores.

No one likes this term. Attempting to describe something by what it isn’t typically doesn’t work — and, to make matters worse, this is about data-store relationships and not about SQL at all. Yet NoSQL databases have significant advantages, including:

Seemingly infinite scalability (Facebook is using Cassandra to store and query 50TB of user inbox data).
Extraordinary fault tolerance.
High availability.
A design-friendly lack of schema.
Integration of both RESTful and cloud computing technologies.

Disadvantages revolve around a basic fact: These are not relational databases built to rapidly process transactions, perform error checking, and maintain data integrity.

For the past 30 or 40 years, database design has focused on adding more controls and scaling vertically. These are two great things for transaction-heavy environments. But times have changed, and some enterprises no longer require expensive, highly redundant hardware on the data storage, server and network. It’s become less expensive computationally to accept that failures can’t be completely prevented.

Today’s large-scale databases are designed to dynamically repair node failures by partitioning and replicating data across clusters. Partitioning the data not only minimizes the impact of any single hardware failure but also distributes the load of database operations. Non-relational databases typically have the ability to maintain multiple hot copies of data. Nodes can fail or be added and replications compiled and moved on the fly. Some NoSQL databases are flexible enough to allow for control over which objects are stored on which replicas to improve performance and scalability.

As the term NoSQL implies, not all non-relational databases support SQL queries. In fact, there are significant differences between how SQL queries are handled between different products. At a minimum, they all offer simple key-value matching (as in a hash table). Typically, stored datas key-value attributes can be queried directly and the document returned. For NoSQL databases that do not fully support SQL, some programming is required to convert a SQL query into something that will run against the data store. The ease of this programming and a data store’s support for SQL are gating factors in adoption since SQL is a core business IT skill in many enterprises. The degree to which countless SQL admins can be brought to bear in the new world of non-relational databases will ease implementation and increase adoption.

Matt’s short list:

Amazon SimpleDB

SimpleDB is a key component of Amazon’s Cloud computing offering, along with Elastic Compute Cloud (EC2) and Simple Storage Service (S3). SimpleDB is a mature data store with the goal of simplicity. SimpleDB supports eventual consistency via asynchronous replication. Replication is read-only and there are no auto-sharding features. SimpleDB organizes documents into domains that contain their own indices and metadata. Domains may be stored on different Amazon nodes.

Cassandra

Cassandra, written in Java and available under Apache licensing, uses column groups for partitioning and replication. Updates are cached in memory and then flushed to disk, where the files must be periodically compacted. Failure detection and recovery are fully automated. Cluster membership is managed via a gossip-style algorithm. Cassandra provides eventual consistency. There is also some support for versioning and conflict resolution.

The database can track intersections and unions between enormous numbers of data objects. Cassandra saves query time by denormalizing the data before it is stored and precomputing join operations between records.

CouchDB

CouchDB is an Apache project written in Erlang. According to Apache, CouchDBis a “distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API.” There are libraries for different languages such as Java, C, PHP, and Python, that convert native API calls into RESTful calls. Data is stored in “documents”, which are essentially key-value maps themselves, using JSON data types. Documents are grouped into collections where schema reside.

CouchDB can do full text indexing of documents, and its queries are defined through Javascript. Queries can be distributed in parallel over multiple nodes through the use of map-reduce. This is still a largely manual process, i.e., developers must write the queries. For example, storing name, age, sex, address, IM name and other user-related fields, many of which could be null, quickly gets awkward; requiring adding and changing columns in the database and updating application code accordingly.

CouchDB scales through asynchronous replication but lacks an auto-sharding mechanism. Reads are distributed to any server while writes must be propagated to all servers. CoucheDB does not guarantee consistency, although it does implement MVCC on individual documents. If someone else has updated the document since it was fetched, then CouchDB relies on the application to resolve versioning issues. CouchDB has limited transaction processing functionality. All document and index updates are flushed to disk on commit.

MongoDB

MongoDB, a GPL open source document store, is one of the most feature-rich of the non-relational NoSQL databases. Written in C++ and sponsored by 10gen, it provides indices on collections and provides a sophisticated document query mechanism. SQL support is excellent and dynamic queries are supported by automatic indexing.

MongoDB supports automatic sharding. Master-slave replication is used mostly for failover, not for scalability. MongoDB maintains eventual consistency and global consistency with a current local primary copy of a document that then replicates throughout the data store nodes.

MongoDB stores data in a binary format called BSON, which supports Boolean, integer, float, date, string and binary types. Client drivers encode the document data structure into BSON and send it via socket connection to the MongoDB server. MongoDB supports a GridFS specification for large binary objects such as images and videos that are stored in chunks and can be streamed to clients. MongoDB supports map-reduce to aggregate queries across documents.