B.6.4.3.3 Workpackage 3: High level transformation parametrized by a Streaming Abstract Machine
Workpackage number:
Start date or starting event: Month 1
Workpackage title: High level Partitionning, scheduling and optimization
Person months per partners: INRIA 60.
Objectives:
- Based on high-level information from the application and architecture description (as specified in WP2), partition and schedule the streaming application at the functional-block level (e.g., a collection of filters in a video processing application, or multiple stages in a 3D-graphics pipeline), optimizing the exploitation of task parallelism on the multi-core/multi-cluster streaming architecture (load balancing, communication/computation pipelining, local memory management).
- Design aggressive optimizations across functional block boundaries, combining interprocedural analysis and optimization, clustering of multiple functional blocks into larger compilation units, and loop optimizations targetting the memory hierarchy and the extraction of fine-grain parallelism (for WP4).
Description of Work
Future and emerging embedded architectures increasingly rely on compiler technology to extract parallelism and improve memory locality. Yet todays compiler optimizations can hardly cope with the multiple levels of parallelism, heterogeneous memory hierarchies and run-time behavior of modern architectures. Indeed, driving program optimizations and code generation for multi-core processors with vast computing resources (vectors, very long instrucion words, hardware threads) is a major challenge. This workpackage thus faces a hard problem of high-end compilation. Despite multiple efforts throughout the last 20 years, attempts at automatically generating efficient parallel code from sequential imperative programs have seen limited success in production environments. Several participants of the projects have been directly involved in some of these past attempts, and our experience is key to understanding the reasons for these half successes.
First of all, most parallelizing compilers rely on static analyses to recover (reverse engineer) high-level data-flow information lost in the implementation problem. These analyses are very fragile: small syntactic variations in the input program may ruin the whole optimization chain, with little or no feedback to the application programmer. We avoid this pitfall in capturing domain-specific knowledge of streaming applications through the rich high-level semantics of the programming model defined in WP2. In particular, the hiearchical block-structured decomposition of each functional block is supposed to come from the application programmer, avoiding chaotic attempts at automatically splitting a function into independent blocks in the compiler. In addition, each block is provided with precise execution time and profiling information, allowing for precise load-balancing and placement.
The second pitfall is the unpredictability of current programmable architectures (embedded and general-purpose). On modern microprocessors, multiple architecture phenomena occur simultaneously and interact together, which requires multiple combined program transformations [1]. Whether these transformations are selected through static analysis and models, runtime feedback, or both, the underlying infrastructure must have the ability to perform long and complex compositions of program transformations in a flexible manner. Existing compilers are ill-equipped to perform that task because of rigid phase ordering, fragile selection rules using pattern matching, and cumbersome expression of loop transformations on syntax trees. Our approach to avoid this pifall is twofold: (1) to target simpler, predictable architectures which let the compiler take most optimization decisions ahead of the execution, and which expose the “spatial� characteristics of the architecture to the compiler (e.g., through explicit data movements, local memory management and placement of tasks); (2) to develop and implement advanced dependence analysis, loop transformation and parallelization techniques suitable for the automatic selection and tuning of long transformation sequences on embedded streaming architectures.
-> FOOTNOTE? Notice the work conducted in this workpackage contributes to the improvement of GCC beyond the the domain of streaming applications. Loop optimization across function boundaries is an active research topic for general-purpose and embedded compilation, as well as innovative polyhedral representations of loop nests and their transformations.
Task 1: Partition and schedule the application on the abstract streaming machine.
The work will first focus on providing of formal definition of the optimization problem, then study optimal and/or heuristic algorithms to exploit task parallelism in the streaming application. Technically, we are dealing with a constrained resource allocation and scheduling problem, considering domain-specific information about the application (annotations from WP2):
- the functional-block level data-flow in the application;
- the hierarchical block-structured semantics of each functional block (and the associated parameters);
- memory requirements for each block (independent of the high-level schedule and communication scheme);
- the execution time profile of each functional block;
- any real-time constraints (production rates and latencies);
and considering architecture-specific information (abstract streaming machine description from WP2):
- the number and nature (general purpose or HW accelerators) of computing clusters in the ASM;
- the communication network between these clusters (bandwidth, latencies, communication control);
- the amount of local memory within each computing cluster.
From this theoretical study, we will design a practical algorithm and implement a first prototype in a development branch of GCC. This prototype will evolve into a portable optimization pass in the GCC compiler, operating on the intermediate representation designed in WP2. This pass must combine interprocedural data-flow analyses based on user annotations, interprocedural transformations like inlining and cloning, and intra-procedural transformations such as scalar promotion and explicit data movements (DMA control, local memory management). Yet no existing optimization pass in GCC requires breaking procedural boundaries, and the combination of inter and intraprocedural optimization is subject to active research [TelescopingLanguages, SoftwareThreadIntegration, ProcedureBoundaryElimination_by_DaveAugust_YetUnpublished?]. The prototype optimization pass must also merge high-level data-flow information collected from user annotations with scalar and array data-flow information available at the GIMPLE Tree-SSA level. Although this will translate into significant development and implementation efforts, it will be compatible with the size and scope of this project, since the development may be easily layered into incremental improvements (e.g., operating first at the procedure call graph level, then integrating Tree-SSA operations), and given that the baseline GCC infrastructure will not have to be modified.
Deliverables.
D1.1 a report about the most significant candidate optimizations at the function level
D1.2 a cost model for these optimizations based on the ASM, document and prototype implementation
D1.3 a prototype of function level optimizations and automatic generation of placement and communication code
D1.4 a robust version of this implementation, synchronized after the milestones on the ASM and program model design have been reached
Task 2: LNO.
Other generic compiler capabilities that are needed in order to improve the capability of the vectorizer to detect vectorizable loops, include: Support for user hints through the usage of pragmas; Improved data-dependence analysis (by employing more powerful dependence tests); Support for runtime dependence and aliasing tests; Develop loop transformations that can simplify vectorization (such as loop transpose, loop distribution, and more); Develop data layout transformations to improve data accesses and data alignment; Develop inter-procedural analyses such as alignment information to increase the potential for vectorization and increasing the quality of vectorized code.
In addition, we propose to develop an alternative development branch whose aim is to bring the expressive power and precision of polyhderal program representations to a production environment. This branch will be a port and extension of a preliminary implementation in the (free) Open64/PathScaleEKO compiler, aiming at native IA64 (itanium), AMD64 (athlon) and IA32 (pentium) code generation, along with source-to-source optimization of Fortran90, C and C++ [2]. Technically, our approach relies on a unified polyhedral representation of loops and statements. The key is to clearly separate four types of actions associated with program transformations: iteration domain, schedule, data layout and memory access functions modifications, revisiting most analyses and transformations with respect to the schedule component of the program representation. We wish to extend this framework and implementation with new optimizations supporting embedded streaming processors, including loop transformations to enable the automatic vectorization of irregular, multiply-nested loops, as well as automatic partitioning and parallelization (with generation of OpenMP constructs). Although our framework supports source-to-source compilation and is thus theoretically independent on the back-end compiler, we wish to take the opportunity of this project to port it to GCC, to improve its robustness and its acceptation by a wider community of developers and embedded system designers. This is a relatively lightweight process given the portable plugin nature of our implementation.
Deliverables.
D2.1 a document specifying which loop nest optimizations are of interest in the relevant applications.
D2.2 an implementation inside GCC4 to support the loop nest optimizations identified in D5.1.
D2.3 a parameterization of this LNO by the ASM, triggered in the GCC compilation flow when dealing with streaming applications.
D2.4 a prototype port of the polyhedral LNO being developped at INRIA to the GCC environment.
D2.5 a summary of our findings on the respective power of syntax-based versus polyhedral LNO implementations.
Milestones.
M2.1 decide whether to support research and development of a polyhedral LNO in GCC, with an emphasis on streaming applications and stream-processing specific program transformations.
[1] D. Parello, O. Temam, A. Cohen, and J.-M. Verdun. Towards a systematic, pragmatic and architecture-aware program optimization process for complex processors. In ACM Supercomputing'04, Pittsburgh, Pennsylvania, November 2004.
[2] Albert Cohen, Sylvain Girbal, David Parello, Marc Sigler, Olivier Temam, and Nicolas Vasilache. Facilitating the search for compositions of program transformations. In ACM Int. Conf. on Supercomputing (ICS'05), Boston, Massachusetts, June 2005.
ADD REFERENCES on LNO design, global optimization at the function level, and GCC-related developments.
