Multicore architecture roadmap

Meeting minutes from the roadmap discussions in Barcelona on June 3, 2008:

o Problem of finding some consensus on performance characterization and how to present and measure performance.

o Benchmarks also

o Multicore workshop!!

o An important point :architectural support for programming models – And how to bring programming model and MC arch work together

§ Transactional memory?

§ Other?

o Request: each member define his research area and interest in the wiki.

· ROADMAP

The following list of challenges were identified: The number within parentheses gives an indication of how many people that attended the meeting that are working on each challenge.

o Big problems (people are working on)

§ Off-chip memory BW (6)

§ Programming (8) -

§ Power (6)

§ Test & verification (0)

§ Reliability (6)

§ Application specific solutions

§ Scalable one-size fit all architecture – scalability (15)

§ End-of-CMOS (0)

§ Resource management (15)

§ Heterogeneity (5)

- Heterogeneous multicore coupled with efficient scalable programming model tools

- OR Homogeneous multicores some kind of heterogeneous programming models.

§ Debugging and performance monitoring (1)

§ Virtualization (3)

§ Design Space Exploration (4)


Roadmap draft

Draft of the Roadmap for the Multicore Architecture Cluster
During the HiPEAC Computing Week (June 2 – June 6, 2008) a meeting was arranged with about 50 researchers in attendance to discuss the main challenges facing the multi-core architecture area. Following the questionnaire assembled by Dr. Marc Duranton, the roadmap coordinator of HiPEAC, this meeting distilled a number of challenges for the area. These are briefly discussed below.

A multi-core architecture is a MIMD (multiple-instruction multiple-data) multiprocessor using the terminology that has been prevailing for many decades. In the last decade, chip multiprocessing (mostly heterogeneous, up to 6-8 cores) has been commonly used in embedded SOCs, thus anticipating some of the trends that have since then been adopted also by mainstream general-purpose processors. However, the ad-hoc programmability of such embedded system has been far from satisfactory, and we now have enough transistors to integrate even more complex cores on a single chip. Envisioning a multi-core microprocessor with 256 cores by 2015, several opportunities and system challenges arise at the architecture level. Multi-core challenges are identified at the hardware and the software level. Hardware challenges are discussed here. Each challenge first lists a number of keywords that were collected in an informal poll during the meeting. We also informally asked how many people are working on those issues and which is the number within the parenthesis after each keyword.

Challenge 1: Scalability, Power, Reliability, and Verification

Scalable one-size fit all architecture (15)
Reliability (6)
Power (6)

Multi-core architectures promise to deliver scalable performance by scaling up the number of cores with each new technology generation. From the architecture perspective, the major challenges for the long-term roadmap will be to deliver computational performance that scales linearly with the number of cores within the constraints of the underlying technology (CMOS for another 10-15 years). Therefore, scalability will be the major concerns and will drive architecture research when it comes to all subsystems (the core architecture and the memory subsystem). Another major challenge is to offer scalable performance within an acceptable power budget. This will have a dramatic impact on the design decisions made to achieve scalability. Reliability issues will play a more important role including dealing with soft errors, process variability etc. Such process imperfections will constrain viable design decisions more and more as we move along the technology roadmap. At some not so distant point in time, we will be forced to consider how to extend the technology roadmap and hybrid technologies will have to be considered. This will certainly have a major impact on the architecture research as well. Another cross-cutting issue is to deal with complexity. The multicore paradigm shift did not only happen because of problems to deal with frequency scaling and instruction-level parallelism. The complexity of superscalar processor cores led to a significant effort in verification. In the beginning of the multi-core era, one can move from one technology generation to another by simply doubling the number of cores. However, as we scale up multicore architectures, complexity issues will be more prevalent. Therefore, scalability must also consider verification issues.

Challenge 1: The Parallel Programming Bottleneck

Programming (8)
The most pressing issue we face in the multi-core era is how to solve the giant parallel programming dilemma. Architecture research can be of major help here. When moving on the multi-core roadmap, at some point traditional software-based synchronization methods will no longer be feasible and new (hardware-based) methods will have to be introduced. Transactional memory is one candidate, but it is probably just the initial approach. In fact, the hardware/software interface, i.e., the instruction-set architecture has more or less stayed unaltered for several decades. An important challenge is to understand which hardware/software abstraction can enhance the productivity of parallel software development and then find suitable implementation approaches to realize it. In fact, the abundance of transistors available in the next decade can find good use in realizing enhanced abstractions for programmers.

Challenge 2: The Memory System Bottleneck
Off-chip memory BW (6)

Off-chip memory bandwidth: The critical infrastructure to host a large core count (say 100-1000 cores in ten years from now) consists of the on-chip memory subsystem and network-on-chip (NoC) technologies. Scaling these subsystems in a resource-efficient manner to accommodate the foreseen core count is a major challenge. According to ITRS, the off-chip bandwidth is expected to increase linearly rather than exponentially. As a result, a high on-chip cache performance is crucial to cut down on bandwidth. However, we have seen a diminishing return of investments in the real-estate devoted to caches, so clearly cache hierarchies are in need of innovation to make better use of the resources.

Scalable cache coherence: At the scale of cores that is foreseeable within the next decade, it seems reasonable to support a shared memory model. On the other hand, a shared memory model requires efficient support for cache coherence. A great deal of attention was devoted to scalable cache coherence protocols in the late 80s and the beginning of the 90s and enabled industrial offerings of shared memory multiprocessors with a processor count of several hundred, e.g., SGI Origin 2000. More recently, the latency/bandwidth trade-off between broadcast-based (snooping) and point-to-point based (directory) cache coherency protocols has been studied in detail. However, now that we can soon host a system with hundreds of cores on a chip, technological parameters and constraints will be quite different. For example, cache-to-cache miss latencies are relatively shorter and the bandwidth on-chip is much larger than for the ``off-chip" systems of the 90s. On the other hand, design decisions are severely constrained by power consumption. All these differences make it important to revisit the design of scalable cache coherence protocols for the multi-cores in this new context.

Challenge 3: Heterogeneous Multicore Architectures
Heterogeneity (5)
Application specific solutions

Future multicore architectures will exhibit heterogeneous computing cores for many reasons. First of all, homogeneous multi-core architectures will most likely only deliver scalable performance for a quite limited domain of embarrassingly parallel problems. Hardware accelerators for important classes of computational problems will therefore play an important role in achieving scalable performance within an acceptable power budget. With such increasingly massive heterogeneous multicore architectures, a major challenge is to offer a scalable programming model and from the architecture side, a major challenge is to offer hardware support to realize that. For example, multiple heterogeneous cores have their own design complexity issues, as special-purpose cores have significant impact on the memory hierarchy of the system, and require specially designed communication protocols for fast data exchange among them. A major challenge is the design of a suitable high-performance and flexible communication interface between less traditional computing cores (e.g. FPGAs) and the rest of the multi-core system.

Challenge 4: Debugging, Resource Management, Performance Monitoring, and Virtualization

Hardware support for debugging Debugging a multi-core multi-ISA application is a complex task. The debugger needs to be both powerful and must cause very low overhead to avoid timing violations and so-called Heisenbugs. This is currently a big problem for existing debuggers, since providing a global view of a multi-core machine is virtually impossible without specialized hardware support. Much more so than a classic single-core device, multi-core chips have to be designed to support debugging tools. The proper hardware support is needed to non-intrusively observe an execution, to produce synchronized traces of execution from multiple cores, to get debug data into and out of the chip.

Resource management.

Performance monitoring.

Virtualization.