Building Resilient Systems brings together key concepts and methods to meet crucial challenges of energy-efficient system resilience. Computer system design is undergoing a paradigm shift in the wake of several disruptive trends: (a) CMOS device technology scaling is becoming progressively more difficult; yield and cost concerns are very high as fabrication capabilities get consolidated to just a few vendors; (b) the power and reliability walls provide major obstacles to the goal of improving hardware performance at historical levels; (c) computing paradigms in the era of internet-of-things (IOT) are evolving towards a more networked, distributed model.
This book presents a modern perspective on how to build resilient computer systems, emphasizing reliability without incurring unaffordable levels of overhead, such as processor chip area, net system power or performance degradation. The late CMOS era design constraints impose hard limits on chip-level power density, current density and thermal profiles. Various kinds of transient and permanent failure mechanisms as well as low-yield concerns are on the rise, and the power wall makes it impractical to go for massively redundant architectures. The author advocates the use of cross-layer, hardware-software co-design techniques to minimize the power and cost overhead, while maximizing performance and system resilience. The book provides new generation modeling methods, cross-layer optimization and trade-off analysis techniques as well cross-layer error tolerant system architectures that will be needed for the future.
- Covers cross-layer resilience modeling, a new technique to optimize energy consumption across system layers
- Explains the necessary tradeoffs to provide targeted system resilience without blowing the cost or power budget
- Presents application-driven compute engines that offer cost-effective solutions for big data analytics, cloud or mobile computing, or cognitive systems without compromising end-user quality and system availability
- Includes case studies illustrating examples of embedded, server, and supercomputing systems
Computer engineers and computer science researchers studying resilience or fault tolerance across multiple domains
1. Technology trends: power wall vs. reliability wall
2. Circuit- and gate-level fault models relevant to modern design
3. Fundamentals of application-level resilience modeling and analysis
4. Cross-layer resilience modeling and failure mitigation
5. Case studies
- No. of pages:
- © Morgan Kaufmann 2019
- 1st October 2018
- Morgan Kaufmann
- Paperback ISBN:
Pradip Bose is a Research Staff Member and Manager of the Reliability- and Power-Aware Microarchitectures Department at IBM T. J. Watson Research Center. His research interests are in the area of processor and system architectures, with a focus on technology-aware design. Pradip is also an Adjunct Professor in the Department of Computer Science at Columbia University. During 1983-1987, Pradip was a member of IBM’s pioneering RISC superscalar processor R&D team. During the 1989-90 academic year, Pradip was on sabbatical leave from IBM, serving as Visiting Associate Professor at Indian Statistical Institute (ISI) in Calcutta, India. At ISI, Pradip served as the coordinating leader of an UNDP-sponsored project on knowledge- based computer systems.
Reliability and Power-Aware Microarchitecture Department, IBM T.J. Watson Research Center, Yorktown Heights, NY, USA