Petabyte Power - CIO Update

So, just for starters and to be sure we’re all clear on the subject, just how big is a petabyte? Well, a terabyte is 10 to the 12th power bytes (a one followed by 12 zeroes) and a petabyte is 10 to the 15th power bytes (a one followed by 15 zeroes), or a thousand Terabytes. In human terms it goes soemthing like this: If every PC had a 50GB hard drive, storing a petabyte would take 20,000 PCs.

Okay then, the question now becomes, What do I do with that much storage? Who needs it? Who wants it?

Well, quite a few people in fact. Very large data warehouses are already providing significant return on investment to the companies using them. In today’s world, competitive advantage comes not from differences in prices and products, but from having more detailed information about your customers and potential customers than the competition does.

Converting prospects into loyal customers means presenting them with just the right products, services, and information at just the right time. Companies can do this only if they have collected enough detailed information about each prospect to identify the important patterns and have the proper systems in place to put information together and act upon it in a timely manner.

The companies that do the best job will be the winners: “retail is detail” as “they” say. Technology has given companies the power to collect detailed data in quantities (hundreds of terabytes already, and a petabyte isn’t be far in the future) and deploy it in time frames (seconds) that once would have seemed possible only in science fiction. To search and deploy such huge volumes of data so quickly, scalability is crucial.

Scalability is the ability to add more processing power to a hardware configuration and have a linearly proportional increase in performance. Or, looked at another way, it is the ability to add hardware to store and process increasingly larger volumes of data (or increasingly complex queries or increasingly larger numbers of concurrent queries) without any degradation in performance. A poor design or product deployment does just the opposite: it causes performance to deteriorate faster than data size grows.

Scientific institutions, such as Lawrence Livermore National Laboratories, have been working with massive amounts of data (hundreds of terabytes) for hydrodynamic and particle-in-cell simulations for decades. They have often custom developed the programs, operating systems, and compilers to exploit scalable hardware for these purposes.

However, companies such as SBC and others have brought this capability into the mainstream with commercial systems capable of harnessing hundreds of top-of-the-line Intel CPUs with many hundreds of gigabytes of addressable memory and hundreds of terabytes of disk space all supporting a single, integrated database.

So, what’s involved in successfully designing and deploying such a system? True scalability has four dimensions:

Dimension One: Handling the Size

Every day businesses gather staggering amounts of data that can be used to support key business applications and enterprise decision-making. Meanwhile, the price per megabyte is falling. Yet the question remains: does the extra data add enough value to justify the expense of storing it?

It does if businesses can efficiently retrieve richly detailed answers to strategic and tactical business queries.

Assume, for example, a multinational bank wants to score the lifetime value of its customers in one key customer segment. If the database still uses a serial approach to data processing, such a query could bog down the system. In contrast, by using a divide and conquer approach with massive amounts of data, through deployment of parallel technology and a “shared nothing” architecture, answers to key business questions arrive more quickly and more reliably. That’s where quantifiable business value begins.