Courses
The summer school consists of 12 courses spread over two morning slots and two afternoon slots. Per slot there are three parallel courses of which you can take only one. The number of students that can enroll in one course is limited to 50. When applying for admission, you will be asked to indicate your preference.
The courses have been allocated to slots in such a way that it is in any case possible to create a summer school program that matches your research interests.

High Performance Embedded Computing, Josh Fisher
Historically, high performance embedded computing has been accomplished via normal (read, "slow") processors and attached specialized devices (typically ASICs). But silicon densities and speeds are making the inevitable happen in embedded computing: processors are assuming more and more of the high-performance load, taking that load away from specialized devices. Techniques like Instruction-level Parallelism that seemed too expensive for embedded designs have become feasible and popular. This change is bringing in a new age of embedded computing design, in which a high-performance processor is central. Increasingly, the traditional elements of nonprogrammable components, peripherals, interconnects and buses must be seen in a computing-centric light. Embedded computing designers must design systems that unify these elements with high-performance processor architectures, microarchitectures and compilers, and with the compilation tools, debuggers and simulators needed for application development.
This course is about this new world of embedded computing, with the strongest emphasis on the processing aspects. It would be easy to think that the topics in such a course would be a retreading of general-purpose, high-performance processing, and in places that is true. But more often, subtle differences between the general-purpose world and the embedded world change the equation a lot. This course will address these differences, and introduce many of the critical technologies of high-performance as are relevant to embedded computing. Since VLIW processors are a main path to very high performance in the embedded world, and are offered by most embedded suppliers, they are emphasized as well.
Much of the material in this course will be based on Embedded Computing: A VLIW Approach to Architectures, Compilers and Tools, published in 2005 by Morgan-Kaufmann/Elsevier, see: http://www.vliw.org/book/
Memory Systems and Their Implementation Technologies, Trevor Mudge
Memories are the heart of computer systems. This series of lectures will provide an introduction to a broad range of memory systems. We will cover the common types of semiconductor and magnetic memories, and their underlying technologies. The lectures are designed to be self-contained. Topics will include:
- Volatile memories
- Basic organization of memory chips and their implementation
- SRAMs and DRAMs – basic memory circuits
- Building larger memories – packaging
- Bus interface alternatives – SDRAM/DDR/RDRAM
- Access time and power models – dynamic and static (leakage)
- Multiported memory organizations
- Techniques for limiting power: power gating/drowsy caches/subthreshold behavior
- Limits to scaling – implications for future organizations
- Performance evaluation of memory systems
- Reliability – fault tolerant memories
- Single event upsets
- Non-volatile memory
- Flash memory – basic memory circuits
- Access time and power models
- Disk memory – basic organization
- Access time and power models – DMA support
- Fault tolerant/RAID storage systems
High-Speed Interconnection Networks, Jose Duato
This course will present a novel approach to describe the architecture of high-speed interconnection networks, starting from a direct connection between two devices and moving to higher degrees of complexity as we try to interconnect more devices. At each level, new problems and alternative solutions for those problems will be introduced.
Course contents:
- Interconnections and interconnection networks
- The need for high-speed interconnects
- Requirements for high-speed interconnects
-
Network architecture Direct communication between two devices
- Link pipelining
- Flow control
- Buffer management
- Communication through a single bus
-
Communication through a single switch
- Switch architecture
- Switching techniques
- Routing and arbitration
- Congestion and head-of-line blocking
-
Networks with multiple switches
- Direct and indirect topologies
- Deadlock handling
- Routing revisited
- Fault tolerance
- Congestion management
- Quality of service (QoS) support
- Reducing power consumption
- Services provided by the network interface
Chip multiprocessors, Per Stenström
The number of transistors available on a chip has now enabled the integration of multiprocessors on a chip – chip multiprocessors. In addition to the instruction-level parallelism exposed by each individual processor, chip multiprocessors expose also a coarser form of parallelism known as thread-level parallelism.
Starting off from the architectural abstraction of this architectural style – the shared address space model – the objective of this course is for the student to learn about fundamentals and architectural techniques to implement that abstraction correctly and performance-, power-, and cost-effectively.
Important topics covered are:
- Memory consistency models
- Cache coherency protocols
- Memory hierarchy concepts
- Latency-tolerance techniques
- Thread-level speculation techniques
For a chip multiprocessor designer, the design space of the processor cores is of course rich as is the spectrum of strategies for interconnecting the processors with the memory subsystem. As these topics are covered in the other courses, they will not be addressed in this course.
Simulation, David August and Olivier Temam
Processor architecture simulation is the key method used by micro-architecture researchers for evaluating the performance of new architectural ideas. The most prevalent modelling methodology, hand-writing monolithic simulation in sequential programming languages, has several serious drawbacks that are exacerbated as systems become more complex. First, these simulators are difficult to construct, retarget, or integrate with one another, hindering the exploration of novel systems and increasing software collaboration barriers. Second, the sequential nature of the simulator code does not resemble the design or operation of actual systems, instilling little confidence in the accuracy of the results. Third, sequential programming language code does not benefit from being hosted on the coming generation of CMP systems, limiting simulation speed without significant manual host retargeting especially when modelling higher-order CMP systems. These issues have had severe consequences on micro-architecture research: the use of outdated architectural models, little reuse of development efforts across the community, unverifiable and unreliable results, and a reluctance of industry to take architecture ideas from academia.
In the past few years, the Liberty group at Princeton and the INRIA Alchemy group have both been working on an alternative approach to processor simulation called modular simulation (respectively the Liberty project and the MicroLib project), the basic principle being to reflect the processor structure in the software structure, alleviating many of the above mentioned issues. In this lecture series, Olivier Temam and David August will present the basic concepts behind structural simulation, describe our experience with it to date, and discuss its future. Participants will also be exposed to hands-on exercises demonstrating the ease and power of modelling one experiences using this methodology. This course requires that you bring your own laptop to the summer school.
Low power, Avi Mendelson
Low-Power computers are essential for almost any modern computer system and in particular for successful embedded system since most of them are battery operated and many of them also need to be fitted into restrict thermal envelopes and to meet performance expectations.
In this course I will cover different aspects of designing low-power computers, starting at the circuit level and ending with the system perspective. Special emphasis will be given to recent research in this field and for open research topics.
Course schedule:
- Introduction
- Circuit techniques for low-power computers
- Architecture techniques for Low-power
- Temperature control for low power computers
- Software and OS perspective of low power computers.
Compilation for Embedded Processors, Rajiv Gupta
This course will cover code generation and code optimization techniques for embedded processors that aim to optimize for execution speed, memory, and power requirements. We will consider techniques aimed at optimizing execution speed that were developed for general purpose processors. The goal of this discussion will be to distinguish between standard compilation techniques that are suitable for embedded systems and those that are not.
Code size is an important concern in embedded systems because in case of systems being designed to execute a few fixed applications the cost depends upon the instruction memory size while in more general systems the instruction memory traffic and performance are influenced by the code size. We will examine in detail various approaches aimed at reducing the code size including software only, hardware only, and combined software hardware approaches. ISA–based approach taken by dual width instruction set processors (e.g., ARM and THUMB) will also be considered.
Power optimization is also an important goal in embedded systems. We will study how traditional compilation tasks can be tuned to optimize for power by reducing switching activity. We will also consider code layout optimizations for reducing instruction memory traffic which leads to reduced power consumption and improved performance. We will also look at compiler support required for effective exploitation of dynamic voltage scaling and ability to turn off and on various hardware features supported by embedded processors.
Adaptive & feedback driven compilation, Mike O'Boyle
Traditional approaches to optimising programs for high performance processors has been largely based on static analysis. Here, the compiler examines the program for pertinent characteristics which it compares to an internal hard-wired model. If there is a good fit between the program and the model then the compiler applies a set of transformations that aim to improve the performance of the program. However, this approach is beginning to be increasingly untenable. Processor architectures are becoming so complex that it is impossible to predict using simple models what the best optimisation should be. Furthermore, processors change so rapidly that is increasingly difficult for compiler writers to keep up with architectural evolution. This course looks at a different approach where we have an adaptive rather than hardwired compiler optimisation strategy. Here the compiler utilises actual machine behaviour to guide its optimisation process rather than a simplified model.
This approach is applicable right across the board from embedded compilation to dynamic just-in-time compilation. The course will cover the following topics:
- Feedback directed and iterative compilation
- examining how low level runtime profile information can improve code generation and scheduling
- exploring the program optimisation space and how program behaviour changes chaotically with transformation selection
- Dynamic compilation and binary translation
- investigate trade-offs in runtime code generation and how just-in-time compilation allows data-specific program specialization
- examine different approaches to dynamic binary translation and how they impact on optimisation and portability
- Machine learning for program optimisation
- redefine program optimisation as a true optimisation problem: minimising an objective function over a domian and then use mathematical optimisation approaches.
- extend this to field of machine learning using global optimization and predictive modelling techniques ranging from simple interpolation to instance-based learning and gaussian processes prediction
The course will use case studies and refer to recent research papers throughout. New research directions and open problems will be highlighted in the final lecture.
Optimizations in GCC, Ayal Zaks
In this course we'll present several key optimizations in GCC, demonstrating the different intermediate representations of GCC's middle and back ends. This will emphasize some of the new infrastructural frameworks provided recently in GCC's new 4.0 release, opening new opportunities for aggressive optimization development. In particular, we will describe the new Modulo Scheduling and Auto-Vectorization optimization passes introduced to GCC 4.0, building upon our original "Haifa Scheduler" and previous vectorization works. We will also describe new inter-procedural optimizations and analyses under development in GCC, some of which are planned for the next GCC 4.1 release. We will openly discuss our experience working on GCC, with its limitations and potentials.
Tiled architectures, Doug Burger
Wire delays, power limitations, and complexity limitations are moving VLSI processor designs towards tiled architectures. Tiled architectures are distributed processor architectures aimed at exploiting fine-grain concurrency, often from a single programmer-specified thread. They typically incorporate a small number of tiles that are replicated across the chip and connected via an on-chip network, thus simplifying VLSI design complexity. In this course, I will survey the state of the art, describing the current tiled architectures being researched and designed, as well as the motivating technology challenges, design issues, and compilation and programming issues.
- Lecture I: Introduction, Overview, Technology Trends
- Wire delays
- Power issues
- Complexity issues
- Survey of tiled architectures
- On-chip network design issues
- Clocking issues
- Lecture II: The MIT RAW architecture
- Hardware design
- Execution model
- Statically scheduled network design
- Lecture III: The UT-Austin TRIPS architecture
- Execution model
- Instruction set design
- Detailed hardware presentation
- Lecture IV: Other Tiled Architectures
- Multiscalar
- Synchroscalar
- Wavescalar
- Lecture V: Compiling for Tiled Architectures
- Execution models: SPSI, SPDI, DPDI
- Code formation algorithms
- Hyperblocks
- Space-time scheduling
- Simulated Annealing for Instruction Scheduling
- Maintaining sequential memory semantics
Advanced microarchitecture, Yale Patt
Advanced Microarchitecture will not be about every recent nuance in the field, nor will it cover in detail every little thing on every little chip. My objective is to look at what is going on in microarchitecture today, what are the current challenges, and how are we to best solve them for the microprocessor of 2015. The lecture style will be informal, students will be encouraged to challenge the assumptions. The intent is to provide each student with a much deeper understanding of the fundamentals (from an advanced computer architecture standpoint) than the student comes in with.
Lectures:
- Introduction, Focus, and Road-map. Fundamentals on which to build. Emphasis on the tradeoffs. Architecture vs. Microarchitecture, run-time/compile-time. Levels of Transformation. Instruction bandwidth, data bandwidth, and instruction processing.
- Recent solutions: SMT, SSMT, Trace Cache, Block-structured ISA. A tutorial introduction to these fundamental concepts, and what each accomplishes.
- Branch Prediction. Perhaps today, the second most important problem (after data latency). The fundamental problem is bubbles in the pipeline due to redirection of the instruction stream. Branch prediction is one attempt at the solution. This lecture will attempt to set branch prediction in its proper comprehensive perspective, and includes details of other approaches to solving the same problem.
- Alternatives for Concurrency. There are many computing paradigms that allow parallel processing. We will discuss many of them, pointing the pros and cons of each. SISD, SIMD, MIMD, vector processing, array processing, VLIW vs superscalar, DAE, HPS, and various forms of concurrent processing, from the agressive SIMD Connection Machine to an agressive thousands of processors MIMD.
- The latest thing (Pentium M, Power 5, etc.) and what we can expect ten years from now. My view of what the microprocessor of 2015 will look like, incorporated with some of the instances of the past year that point us in that direction.
Virtual machines, Jim Smith
Virtual machines – program execution environments implemented via a layer of concealed software – have emerged as a powerful means for tackling a number of important engineering problems, including software portability and security, resource sharing, and processor performance optimization.
This course will discuss both software and hardware techniques involved in implementing virtual machines. These include virtual machines implemented at both the system level and process level: both the classic system virtual machines and more recent high level language virtual machines such as the one that supports Java. The discussion will include several case studies chosen from commercial implementations as well as research projects.
The course will be based on the book: "Virtual Machines: Versatile Platforms for Systems and Processes" by J. E. Smith and Ravi Nair; to be published by Morgan-Kaufmann, 2005.




