A Few Words About ‘NoSQL’ and Other Unstructured Databases

Riak

Riak is an open-source Erlang based project that uses a RESTful client interface. Objects can be fetched and stored in JSON and can have multiple fields but the document store itself cannot be queried. The only lookup possible is on the primary key.

Riak supports object replication and sharding by hashing on the primary key. Replica values are eventually consistent. Riak relies on vector clocks for version control and includes functionality for reconstituting out-of-synch data.

Architecture is simple and symmetric, relying on consistent hashing to distribute data throughout a ring of nodes. Shards are distributed around virtual and physical nodes. There is no master node to track system status. All of the nodes use a gossip protocol to track node status and data location. Any node can service a request from any client, plus a map/reduce mechanism splits work across nodes.

Riak’s storage can be in memory or on disk or a combination of the two. This flexibility allows for commonly accessed key-value pairs to reside in memory while the rest of the data is on disk.

An important feature that sets Riak apart from others is that it can store links between documents. For example, documents about authors can be linked directly to documents about their books without the need for secondary indices.

Tokyo Cabinet/Tokyo Tyrant

Part of the larger Tokyo Product, these represent C libraries. The front end is Tokyo Tyrant, the multi-threaded back end server is Tokyo Cabinet. The Tokyo Cabinet library creates a key-value store with language bindings for Java, Ruby, PERL, and more. Tokyo Cabinet is an extremely fast embedded database.

Tokyo Tyrant supports get, set, and update operations; asynchronous replication with master/slave or dual master; and record locking, ACID transactions, binary array data types, and complex update operations. Tokyo Tyrant manages Tokyo Cabinet’s three network interfaces: the binary protocol, HTTP for RESTful communications, and Memcached.

Closing Thoughts

Closed solutions that require heavy customization may ultimately constrain data stores. Evaluate non-relational database solutions thoroughly and pilot test before product rollout. Design solutions around architectures that scale based on the type of data to be housed and its requirements. Planning requires an understanding of key differences between centralized RDBMS system design and distributed non-relational system design.

SQL RDBMS transaction processing is not going to disappear. Traditional database design principles still hold true transactional integrity and immediate consistency are required. However, where horizontal scaling to millions of concurrent users is a requirement, non-relational or NoSQL databases warrant serious consideration.

Matt Sarrel is executive director of Sarrel Group, a technology product test lab, editorial services and consulting practice specializing in gathering and leveraging competitive intelligence. He has over 20 years of experience in IT and focuses on high-speed large scale networking, information security, and enterprise storage. E-mail [email protected], Twitter: @msarrel.