Task Force on Reliability and Availability
Motivation
Technological developments are confronting us with the challenge of building computer systems made of unreliable parts. In tomorrow's computing world failures will not be exceptional due to many and frequent causes such as: soft-errors, process variation, wear-out, hardware and software bugs, incomplete specifications, impossibility to simulate all use cases etc. An additional challenge stems from the wide diversity of computing systems, high-end supercomputers to low-end embedded devices, each having differnet reliability and availability (R&A) requirements. The worst case forecast is that R&A constraints may obviate the benefits of technology scaling and slow down economic growth in several ICT driven markets. Consequently, future computing systems need to be dependable. More specifically, need to operate correctly and satisfactory in the presence of faults, capable of online detect, repair and recovery from faults, have small down time, and do so at low dependability/euro ratio.
The purpose of this task force is to promote awareness and research activity within HiPEAC related to R&A. Some of the instruments to achieve this goal are:
- Organizate R&A related activities at summer school, conference, cluster meetings
- Mobilize researchers in R&A related areas
- Augment the HiPEAC roadmap vision of R&A issues
NEWS: 2nd Workshop on Design for Reliability during HiPEAC 2010 Conference, Sunday January 24, 2010
Task Force Meetings
Conferences and Other Related Sites
Past Activities with HiPEAC Affiliation
Contact: Yiannakis Sazeides (UCY)
CPPC
CPPC (ComPiler for Portable Checkpointing) is a checkpointing tool focused on the insertion of fault tolerance into long-running message-passing applications. It is designed to allow for execution restart on different architectures and/or operating systems, also supporting checkpointing over heterogeneous systems, such as the Grid. It uses portable code and protocols, and generates portable checkpoint files while avoiding traditional solutions which add an unscalable overhead (such as runtime coordination or message-logging).
