on 04 December 18
As software application technology changed over the decades the evolution in languages and their capabilities resulted in a steady pervasion in computing. With it, of course, there has also been a matching evolution in database storage and retrieval methods. At first there were simple flat files, but this quickly proved to be inefficient in many ways. Then there were hierarchical databases with VSAM and ISAM files and while these suited the processing paradigms of those times (which was mainly batch oriented), they demanded added complexities in programming for data retrieval and manipulation, apart from basis business logic.
With the advent of transactional computing and client server technology, the next evolutionary step was towards relational databases. Relational databases allow the implementation of schemas in which data is modelled into its elements, structured in normalized formats, with relationships enforced through the use of keys and other constraints. SQL or Structured Query Language evolved with it and provided the perfect means with which to insert and retrieve data in almost any subset required. SQL and relational databases enabled the storage of increasingly growing amounts of transaction data. They were an important and fundamental element and enabler of client server computing in many ways.
Over the past few years, however, there has been an explosion in terms of data production, as well as in the need to store and retrieve data with much greater flexibility and, more importantly, at much higher speeds and performance levels. At the same time, the application programming world largely moved from procedural languages to object-oriented languages. Relational databases stored data in dis-aggregated, normalized models, but the application layer needed re-aggregation of this data into objects. As a result, there was a mismatch between the model of how data as it was stored and how it was being used in the application layer. At the same time, database internals became more complex as data volumes grew, and indexes, partitions, and a large number of other data management functions were required to store large amounts of data across distributed locations and hardware, and at the same time maintain its integrity. The rise of OLAP and Business Intelligence also created new data retrieval and analysis demands, and complex SQL queries had to be written in order to not only retrieve data from multiple tables selectively, but also perform all sorts of computations to produce information. These are effective, but can be very expensive when doing full or multi table reads (even if with indexes) to zoom in on just a few records or fields.
To solve the first problem, ie, the mismatch between the relational data model and the aggregation model that came about between the data and application layers, object relational mapping tools such as Hibernate were developed. While this was useful at the time, as time went on it became apparent that this was merely a patch-fix layer between the relational storage and the application, and with the multiple performance bottlenecks that arose with the new needs created by OLAP, warehouses and particularly Big Data, it was clear that the database scientist fraternity had to come up with a fresh look at the whole issue of large volume data storage and retrieval.
The results of their thoughts manifested themselves in NoSQL databases. NoSQL actually stands for Not Only SQL, and not the absence of SQL, meaning that new and alternative perspectives were needed. Fundamental to achieving the new objectives was actually a return to breaking data down even further from a modelling point of view, and loosening up on the purity of the normalized relational model. There are now four main categories of NoSQL databases:
- Key-Value Databases: These are the most free form in terms of storage, with each record consisting of a key and a value. The value could be any element or combination of elements in any format, and so any application using it would have to be tightly coupled with it in terms of being able to retrieve and interpret the values as aggregates. A unique identifier, such as a pointer, would be used to access each key-value pair rapidly, similar to the technique used in caching. This storage is best suited to smaller records, as manipulative querying can be difficult. Memcached is possibly one of the most popular examples of this type of database.
- Document Databases: These go a step further in aggregation, with a key assigned to each higher level aggregate, and groups of related key-value pairs forming the next level. The reason why this type of database is called document database is because each record is essentially a document structure created using standards like XML, BSON or JSON. It is useful in terms of its ability to be read very fast and mapped directly into application layer objects. MongoDB is one of the most widely used databases in this category, another being CouchDB.
- Graph Databases: Graph databases are very useful in applications where data entities need to be stored along with flexibly-defined relationships between them, along with some kind of indication of the relative strength of these relationships. An example could be data on specific company equity stocks within an industry sector. Each company has its own attributes and performs independently, but in the event of a major external business event there could be an impact on all stocks within the industry, but with various degrees, depending on their specifics. Other examples are GIS data, where there are locations of various structures, surface features, water bodies, etc, and a relationship may need to be called up between any two or more of these entities. A major application of such databases is also in the use of extreme personalization management in ecommerce or content websites, where preferences of a user, or his/her relationships with other users or content is useful in retrieving additional data that could be relevant to that user alone. These databases are complex internally, and so are not suited for use when data volumes are very large. Neo4J is one of the best known examples of this type of database. As a natural and almost expected development, the complexity of working with these has to an extent been abstracted by the development of recommendation engines like Google Prediction API, Google Content Recommendation Engine and PredictionIO.
- Columnar Databases: HBase and Cassandra are a couple of prominent databases in this category. A columnar database is also a kind of relational database, and it largely addresses the performance issues that go with high volumes of read and aggregation computations. It does this by flipping a traditional relational table structure on its side, so that each column becomes a table by itself, and retrieval is done by using the primary entity as a key. For example, an employee would have basic personal profile data, and also transactional data of various types, such as compensation data, leave data and performance data. Certain types of data such as salary data may be accessed much more frequently than other data while processing monthly payroll. In a relational model, the nature of joins required to access this data would require full row reads just to actually retrieve the values of perhaps one or two columns, thereby resulting in a performance hit. By breaking the data into families of data and a columnar table structure, the retrieval and aggregation of data can be made much faster because only the relevant columns would be accessed. In this type of database, however, inserts across tables are obviously more expensive than selects.
The answer to what kind of database should be used, clearly, therefore is "it depends". In many cases of business computing, however, columnar databases which support a row as well as column structure, and document databases are likely to be the most widely used. The other best of both worlds solutions could well be sticking with an RDBMS such as MySQL and upgrading to InnoDB with Memcached.
[Update] Based on a number of comments from readers I felt it would be appropriate to add a clarification to whatever I've said so far. SQL is NOT a dying relic, in fact, I don't see it going anywhere for the foreseeable future. Having said that, the reason NoSQL evolved as a new offshoot is because limitations were observed when using conventionally available RDBMS products for specific purposes. SQL RDBMS's continue to hold the largest marketshare and NoSQL alternatives are only just beginning to make inroads in new areas of computing, particularly those involving Big Data.
What will happen in the future in terms of marketshare of the two types of databases is largely dependent on what their product vendors manage to achieve. The major RDBMS vendors have been working on overcoming the limitations of their products so that they can address new types of needs as well. At the same time, the NoSQL vendors have been working on progressing their own products towards increased maturity as well. My guess is that there will be a space for both, depending on the type of need.
The reason that NoSQL alternative emerged was because the conventional RDBMS products were not ready for the new requirements at the right time. But to migrate completely to NoSQL databases remains a difficult and expensive proposition, apart from being one that may largely be unnecessary for most regular computing needs, and which also goes completely against the widely accepted practice of keeping the application layer separate and independent of the data layer. Getting into Big Data does not mean completely moving to Big Data. The best solution is often to have the Big Data infrastructure (including any NoSQL databases) co-exist alongside the existing enterprise application computing infrastructure, unless of course, the business model is one that works purely off Big Data only.