Petabyte Power

Dimension Two: The Challenge of Concurrent Queries

Large corporations need to enable thousands of queries from anywhere within the organization at any time, covering both long- and short-range needs. The multinational bank in the example above might also need fraud detection for countless credit card transactions. Managers might want an analysis of monthly sales figures. Multiply all this by hundreds of business units across various geographic regions and the need for concurrent query capabilities becomes quite clear.

Handling concurrent queries demands that a data warehouse possess sophisticated resource management capabilities. As queries come in, the parallel database must be able to satisfy multiple requests and scan multiple tables.

Dimension Three: Maintaining Business Relationships Among Complex Data

Handling increased data complexity is another challenge for optimizing queries in massive databases. For example, building a simple customer profile might once have involved three or four interrelated data points stored in disparate data marts.

Now it might involve thirty or forty data points, all housed in one enterprise data warehouse. If the warehouse can only create a gargantuan table with billions of pieces of generically categorized transaction data, all the processing capacity in the world isn’t going to deliver a useful customer profile. Even if the warehouse can separate the data into different tables, if it can’t preserve the business relationships among the tables, then the ability to analyze that data and, in turn, the business value is compromised.

Therefore, as warehouses increase their capacity, they also must create a super-efficient “file system” specifically for analytic queries. The system should contain multiple tables, but preserve the business relationships across subject areas for easy cross-referencing and extensibility. For the customer profile example, the deeply detailed information contained in those tables can now deliver unique insights for product development, marketing programs, or a number of other critical business challenges.

Dimension Four: Support for Sophisticated Data Queries and Data Mining

Finally, the super data warehouse must be prepared to handle queries and data mining that ask for more than a tally of last month’s shoe sales. For example, scoring the lifetime value of a customer is really a question with many component parts. The warehouse must be able to break down the various components and determine an efficient route for gathering the appropriate information.

A cost-based optimizer is supposed to automate this process in most databases, but too often database administrators end up having to intervene, a costly and time-consuming process. A data warehouse that truly delivers petabyte-type value would have an optimizer that handles sophisticated queries and data mining without human intervention.

In the world of data, value derives from increasingly detailed and timely business intelligence that informs decision-making across an enterprise. Unless the data warehouse can efficiently organize increasingly complex data and optimize sophisticated and concurrent queries, the amount of data stored is meaningless.

What’s exciting about the petabyte is that the capabilities to do something with that data are on the verge of becoming a reality. That’s a development worth heralding.