Tutorial on Architecture Design for Soft Errors

From 25th Jan 09 To 25th Jan 09
by Joel Emer, Shubu Mukherjee, Intel Corporation

Held in conjunction with: the 4th International Conference on High-Performance Embedded Architectures and Compilers (HiPEAC)
Sunday January 25 Paphos, CYPRUS

As kids many of us were fascinated by black holes and solar flares in deep space. Little did we know that particles from deep space could affect computing systems at the surface of the earth causing blue screens and incorrect bank balances. CMOS technology has shrunk to a point where radiation from deep space and packaging material have started causing such malfunction at an increasing rate. These radiation-induced errors are termed "soft" since the state of one or more bits in a silicon chip could flip temporarily without damaging the hardware. The lack of any appropriate shielding material has caused the design community to look for process, circuit, architectural, and software solutions to mitigate the effect of soft errors.


This tutorial will cover architectural techniques to tackle the soft error problem and is based on Shubu Mukherjee's recent book, "Architecture Design for Soft Errors."Computer architecture has long coped with various types of faults, including ones induced by radiation. For example, error correction codes are commonly used in memory systems. High-end systems have often used redundant copies of hardware to detect faults and recover from errors. Many of these solutions have, however, been prohibitively expensive and difficult to justify in mainstream commodity computing market.


The necessity to find cheaper reliability solutions has driven a whole new class of quantitative analysis of soft errors and corresponding solutions that mitigate their effects. This tutorial will cover the new methodologies for quantitative analysis of soft errors as well as novel cost-effective architectural techniques to mitigate them. This tutorial will also re-evaluates traditional architectural solutions in the context of the new quantitative analysis.


More specifically, this tutorial will cover:
  • Introduction to soft errors
  • Basic understanding of the physics of soft errors
  • Prevalent circuit techniques to protect against soft errors
  • How to analyze soft errors quantitatively at the architecture level
  • How to reduce soft errors using error correction codes (without going into number theory)
  • How to detect soft errors using redundant computation
  • How to recover from a soft error once we detect an error
Much of the material in this tutorial will be based on the book, "Architecture Design for Soft Errors." Elsevier, Inc has copyright (c) to some of the material on this website.