Architecting for the Era of Tera

With television, the Internet, phone calls, and print media, our world is flooded with data. The quantity of data doubles every 24 months — the data equivalent of Moore’s Law.

The amount of worldwide data has grown over 30% per year for the past several years. So much so it is now measured by the exabyte — 1018 bytes, or a billion gigabytes.

As a result, we are faced with a new challenge: what should we do with all of the data?

By itself, data is unusable. To distill data into meaningful information requires finding and manipulating patterns and groupings within the data. And powerful computer systems are required to find these patterns in an ever-increasing pool.

To ensure that today’s computers are able to handle future applications, they will need to increase their processing capabilities at a rate faster than the growth of data.

Defining Tera-Era Workloads

To develop processor architectures capable of delivering tera-level computing, Intel classifies these processing capabilities, or workloads, into three fundamental types: recognition, mining, and synthesis, or RMS.

The RMS model is a good metric for matching processor capabilities with a specific class of applications.

Recognition is the matching of patterns and models of interest to specific application requirements.

Large data sets have thousands, even millions, of patterns; many of which are not relevant to a specific application. To extract significant patterns in the data, a rapid intelligent pattern recognizer is essential.

Mining, the second processing capability, is the use of intelligent methods to distill useful information and relationships from large amounts of data.

This is most relevant when predicting behaviors based upon a collection of well-defined data models. Recognition and mining are closely dependent on and complimentary to each other.

Synthesis is the creation of large data sets or virtual worlds based upon the patterns or models of interest.

It also refers to the creation of a summary or conclusion about the analyzed data. Synthesis is often performed in conjunction with recognition and mining.

The RMS model requires enormous algorithmic processing power as well as high I/O bandwidth to move massive quantities of data. Processor architects use different approaches to maximize performance for each workload in the RMS model; balancing and trading-off combinations of factors including the number of transistors on the die, power requirements, and heat dissipation.

These choices result in architectures optimized for specific classes of workload.

The RMS workloads in tera-level computing require several similar, application-independent capabilities:

  • Teraflops of processing power.
  • High I/O bandwidth.
  • Efficient execution and/or adaptation to a specific type of workload.

With tera-levels of performance, it becomes possible to bring these workloads together on one architectural platform, using common kernels. Tera-level computing platforms will use a single optimal architecture for all RMS workloads.

Enabling the Era of Tera

The power of the computing architecture required for tera-level applications is 10-to-100 times the capabilities of today’s platforms. The figure below illustrates that while the rate of frequency is slowing, other techniques are actually increasing the rate of overall performance.

Moving forward, we see that performance will be derived from new architectural capabilities such as multi- and many-core architectures as well as frequency scaling.

We can expect the rate of performance improvement to actually improve faster than we’ve seen historically with only frequency scaling.

Recognizing the need to increase today’s platform capabilities, Intel is developing a billion-transistor processor. Yet processor improvements in clock speed and transistor count and will not meet the requirements of tera-level computing in the next 25 years.

Challenges

A number of the less-friendly laws of physics are more limiting than Moore’s Law. As clock frequencies increase and transistor size decreases, obstacles are developing in key areas:

  • Power: Power density is increasing so quickly that tens of thousands of watts per square centimeter (w/cm2) will be needed to scale the performance of Pentium processor architecture over the next several years. This creates a problem, though, being hotter than the surface of the sun.
  • Memory Latency: Memory speeds have not increased as quickly as logic speeds. Memory access with i486 processors required 6-to-8 clocks. Today’s Pentium processors require 224 clocks, about a 28x increase. These wasted clock cycles can negate the benefits of processor frequency increases.

    RC Delay: Resistance-capacitance (RC) delays on chips have become increasingly challenging. As feature size decreases, the delay due to RC is increasing.

    In 65nm (nanometer) and smaller nodes, the delay caused by a one millimeter RC delay is actually greater than a clock cycle.

    Intel chips are typically in the 10-to-12 millimeter range, where some signals require 15 clock cycles to travel from one corner of the die to the other; again negating many of the benefits of frequency gains.
  • Scalar Performance: Experiments with frequency increases of various architectures such as superscalar, CISC (complex instruction set computing), and RISC (reduced instruction set computing) are not encouraging.

    As frequency increases, instructions per clock actually trend down, illustrating the limitations of concurrency at the instruction level.

Performance improvements must come primarily from architectural innovations, as monolithic architectures have reached their practical limits.

The New Architectural Paradigm

In the past, mini- and mainframe computers provided many of the architectural ideas used in personal computers today. Now, we are examining other architectures for ways to meet tera-level challenges.

High-performance computers (HPC) deliver teraflop performance at great cost and for very limited niche markets. The industry challenge is to make this level of processing available on platforms as accessible as today’s PC.

Concurrency

The key concept from high-performance computing is to use multiple levels of concurrency and execution units. Instead of a single execution unit, four, eight, 64, or in some cases hundreds of execution units in a multi-core architecture is the only way to achieve tera-level computing capabilities.

Multi-core architectures localize the implementations in each core and effect relationships with the “Nth” level — second and third levels of cache. This creates enormous challenges in platform design.

Multiple cores and multiple levels of cache scale processor performance exponentially, but memory latency, RC interconnect delay, and power issues still remain — so platform-level innovations are needed.

This architecture will include changes from the circuit through the microprocessor(s), platform, and entire software stack.

The SPECint experiments show that microprocessor-level concurrency alone is not sufficient. A massively multi-core architecture with multiple threads of execution on each core with minimal memory latency, RC interconnect delay, and controlled thermal activity is needed to deliver teraflop performance.

The three attributes that will define this new architecture are scalability, adaptability, and programmability.

Scalability

Scalability is the ability to exploit multiple levels of concurrency based on the resources available and to increase platform performance to meet increasing demands of the RMS workloads.

There are two ways to scale performance. Historically, the industry has “scaled up,” by increasing the capabilities and speed of single processing cores. An example of “scaling up” can be found in the helper thread technology.

Helper threads implement a form of user-level switch-on-event multithreading on a conventional processor without requiring explicit OS or hardware support.

Helper threads improve single thread performance by performing judicious data prefetching when the main thread waits for service of a cache miss.

Another method of scaling performance is “Scaling out;” adding multiple cores and threads of execution to increase performance. The best-known examples of “scaling out” architectures are today’s high performance computers which have hundreds, if not thousands, of cores.

In today’s platforms, processors are often idle. For server workloads, processors can spend almost half of their total execution time waiting for memory accesses.

Therefore the challenge and opportunity is to use this waiting time in an effective way. Experiments in Intel’s labs showed that helper threads can eliminate up to 30% of cache misses and improve performance of memory intensive workloads on the order of 10%-to-15%.

o:p>

Adaptability

Adaptability is also an attribute of this new architectural paradigm. An adaptable platform proactively adjusts to workload and application requirements.

The platform must be adaptable to any type of RMS workload. Multi-core architectures not only provide scalability but also the foundation for adaptability.

The following adaptability example uses special purpose processing cores called processing elements to adapt to 802.11a/b/g, Bluetooth, and GPRS.

Each multiple processing element in the graphic is considered to be a processing core. These processing elements can each be assigned a specific radio algorithm function, such as programmable logic array (PLA) circuits, Viterbi decoders, memory space, and other appropriate functions.

Each processing element may be a digital signal processor (DSP) or an application-specific integrated circuit (ASIC). The platform can be dynamically configured to operate for a workload like 802.11b by meshing a few processing elements.

In another configuration, the platform can be reconfigured to support GPRS or 802.11g or Bluetooth by interconnecting different sets of processing elements.

This type of architecture can support multiple workloads like 802.11a, GPRS, and Bluetooth simultaneously. This is the power of the multi-core micro architecture.

Programmability

The challenge of bringing high performance computing to the desktop has been in defining parallelizable applications and creating software development environments that understand the underlying architecture.

A programmable system will communicate workload characteristics to the hardware while architectural characteristics will be communicated back up to the applications.

Intel has started down this path with compilers such as those developed for Itanium processors. Much more must be done to take advantage of the new architectural features in these computing platforms.

We are on the cusp of another leap in computing capabilities that will dramatically impact virtually everything in our lives. With the immense amount of data generated by corporate networks, it is necessary to scale computing to match the increasing level.

The solution to the challenge of tera will herald changes perhaps as dramatic as those brought about by the printing press, the automobile, and the Internet.

R.M. Ramanathan has been a technology evangelist and a Marketing Manager in Intel. In his 10 years with Intel he has held various positions, from engineering to marketing and management. Before coming to Intel, Ramanathan was director of engineering for a multinational company in India.

Francis Bruening has been with Intel foreight years, and has bachelors in computer science from Cleveland State University. He has been a SW developer and manger, and is currently a technology marketing manager, promoting and developing the ecosystems necessary for industry adoption of new technologies.