WHAT IS RELIABILITY AND AVAILABILITY? for servers data consistency and high availability. need system wide reliability at all levels both hardware and software. network reliability for both off and on-chip. not clear if problems on chip will be different than those off-chip. safety. this is a more demanding criterio than reliability - involves a high overhead process (in terms of resources and time) called certification (for ex. certifying the WCET of a program executing on a processor) no single answer to all problems. different types of reliability concerns for hw and sw and for different market segments there are different concerns or weight differently different reliability metrics for different levels of design and market segments. meeting the different criteria inisolation will need to provide overall a reliable design ################################################################# IMPLICATIONS if reliability not addressed adequately we will experience cost increases in computing devices and slow market growth ################################################################# SOLUTIONS/LESSONS: overprovisioning: acceptable approach for area, energy. how much? no single answer for all. depends on market segment. unclear how overprovisioning scales with technology (stay same or increase)? Solutions addressed in the presentation: 1. for software code reviewing process 2. razor technology to catch delay bugs 3. redundant cores 4. software aware placement to avoid mapping to defective parts. os, compiler, and run time environment 5. reliability can it effect-complicate WCET analysis 6. reliability and performance can be contradictory goals. (e.g. replication for availability increases coherence traffic and slows performance) 7. need to design for R&A not just an add-on. 8. R&A design complexity may be higher than faulty free path 8. exploit zero patterns of values in datapaths to reduce what needs to be checked for short errors 9. reliability can affect both architectural and non-architectural resources. former necessary to protect for correctness. latter may be to ensure good performance. 10. fight process variation using adaptive body bias ################################################################# OTHER: Need taxonomy of reliability issues for different aspects and levels of design. Create a site for the task force where info can be disseminated and people expertise can be found Try to identify the skills and interests according to the above to know who is working on what ################################################################# NEXT STEPS: register in r&a email list coolect expertise (names and group web pages) and list them on task force web site participate in the task force preparation of the road map meeting at summer school