Years ago, I provided a lecture at a seminar in the topic of grids and databases. At that time, massive parallel processing (MPP) in database management and data warehousing were limited and we were trying to preach its benefits—although unsuccessfully—to several IT executives.
Looking back, the hardware and associated software were not mature enough to support the needs of majority of organizations. Fast forward to today’s data warehousing market and we are seeing it full of products that are basically performing that task, but in a much mature way.
Today, we have a name for this class of products. They are called data warehouse appliances (DWA). A DWA is a hardware-software bundle that handles massive parallel processing associated with a large-scale data warehouse. These products are designed to take advantage the processing powers of a large number of hardware nodes linked on a grid and to maximize the capabilities of the associated database management system to create ultra-efficient load and search capabilities. In short, terabytes of data can be loaded or searched in relatively short time.
The concept is relatively simple: One of the nodes acts as the distributor or administrator node, as a SQL statement is issued from the invoking application, the distributor breaks it down into several physical sub-queries (based on the number of nodes on the system and the physical distribution of data across the nodes) and distributes the sub-queries across all nodes.
These nodes process the call in parallel and return the results to the distributor node. The distributor node will then collate the results, perform the final sort (if needed), and return the results back to the caller application.
MPP systems have been around for a while and have been working successfully. However, for the most part, MPP systems have been expensive to implement and only a select team of technologists have been able to maximize their efficiencies. In the meanwhile, DWA systems have reached a maturity level at which they can now provide several key differentiation factors from their predecessors. These key differentiators are:
Low TCO : Compared to the previous generation of MPP systems, DWAs are cheap. Their cost is a fraction of the traditional MPP applications. This approach could enable an organization with low-budget and a great need to process large amounts of data to get in the game and truly show the power of a well-designed and run data warehouse.
The reason for their low cost is that a) a number of these appliances are capable of utilizing commodity hardware and as such allow the client to decide on their favorite hardware supplier/operating system, and b) a number of the systems are using open source database management systems and as such removing the need for paying hefty DBMS licenses. All in all, the total cost of ownership is much lower than the alternatives.
Scalability : For my money, this is the most important differentiator for DWA systems. An organization could start building at a small scale (to prove the worthiness of a data warehouse) with five-to-10 nodes and add new nodes and extra storage as requirements and budgets grow.