A Hardware/Software Approach To order this title, and for more information, click here
By David Culler, University of California, Berkeley J.P. Singh, Princeton University Anoop Gupta, Carnegie Mellon University
Description
The most exciting development in parallel computer architecture is the convergence of traditionally disparate approaches on a common
machine structure. This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven
computing architectures. It then examines the design issues that are critical to all parallel architecture across the full range of modern
design, covering data access, communication performance, coordination of cooperative work, and correct implementation of useful semantics.
It not only describes the hardware and software techniques for addressing each of these issues but also explores how these techniques
interact in the same system. Examining architecture from an application-driven perspective, it provides comprehensive discussions of
parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions.
Contents
1 Introduction
1.1 Introduction
1.2 Why Parallel Architecture
1.2.1 Application Trends
1.2.2 Technology Trends
1.2.3 Architectural Trends
1.2.4 Supercomputers
1.2.5 Summary
1.3 Convergence of Parallel Architectures
1.3.1 Communication Architecture
1.3.2 Shared Memory
1.3.3 Message-Passing
1.3.4 Convergence
1.3.5 Data Parallel Processing
1.3.6 Other Parallel Architectures
1.3.7 A
Generic Parallel Architecture
1.4 Fundamental Design Issues
1.4.1 Communication Abstraction
1.4.2
Programming Model Requirements
1.4.3 Naming
1.4.4 Ordering
1.4.5 Communication and Replication
1.4.6
Performance
1.5 Concluding Remarks
1.6 References
1.7 Exercises
2 Parallel
Programs
2.1 Introduction
2.2 Parallel Application Case Studies
2.2.1 Simulating Ocean Currents
2.2.2 Simulating the Evolution of Galaxies
2.2.3 Visualizing Complex Scenes using Ray Tracing
2.2.4 Mining Data
for Associations
2.3 The Parallelization Process
2.3.1 Steps in the Process
2.3.2 Parallelizing Computation
versus Data
2.3.3 Goals of the Parallelization Process
2.4 Parallelization of an Example Program
2.4.1
A Simple Example: The Equation Solver Kernel
2.4.2 Decomposition
2.4.3 Assignment
2.4.4 Orchestration under
the Data Parallel Model
2.4.5 Orchestration under the Shared Address Space Model
2.4.6 Orchestration under the Message
Passing Model
2.5 Concluding Remarks
2.6 References
2.7 Exercises
3 Programming
for Performance
3.1 Introduction
3.2 Partitioning for Performance
3.2.1 Load Balance and Synchronization
Wait Time
3.2.2 Reducing Inherent Communication
3.2.3 Reducing the Extra Work
3.2.4 Summary
3.3 Data Access and Communication in a Multi-Memory System
3.3.1 A Multiprocessor as an Extended Memory Hierarchy
3.3.2
Artifactual Communication in the Extended Memory Hierarchy
3.4 Orchestration for Performance
3.4.1 Reducing
Artifactual Communication
3.4.2 Structuring Communication to Reduce Cost
3.5 Performance Factors from the
Processors? Perspective
3.6 The Parallel Application Case Studies: An In-Depth Look
3.6.1 Ocean
3.6.2 Barnes-Hut
3.6.3 Raytrace
3.6.4 Data Mining
3.7 Implications for Programming Models
3.8 Concluding Remarks
3.9 References
3.10 Exercises
4 Workload-Driven Evaluation
4.1 Introduction
4.2 Scaling Workloads and Machines
4.2.1 Why Worry about Scaling?
4.2.2 Key Issues in Scaling
4.2.3 Scaling
Models
4.2.4 Scaling Workload Parameters
4.3 Evaluating a Real Machine
4.3.1 Performance Isolation
using Microbenchmarks
4.3.2 Choosing Workloads
4.3.3 Evaluating a Fixed-Size Machine
4.3.4 Varying Machine
Size
4.3.5 Choosing Performance Metrics
4.3.6 Presenting Results
4.4 Evaluating an Architectural
Idea or Tradeoff
4.4.1 Multiprocessor Simulation
4.4.2 Scaling Down Problem and Machine Parameters for Simulation
4.4.3 Dealing with the Parameter Space: An Example Evaluation
4.4.4 Summary
4.5 Illustrating Workload
Characterization
4.5.1 Workload Case Studies
4.5.2 Workload Characteristics
4.6 Concluding Remarks
4.7 References
4.8 Exercises
5 Shared Memory Multiprocessors
5.1 Introduction
5.2 Cache Coherence
5.2.1 The Cache Coherence Problem
5.2.2 Cache Coherence Through Bus Snooping
5.3 Memory Consistency
5.3.1 Sequential Consistency
5.3.2 Sufficient Conditions for Preserving Sequential Consistency
5.4 Design Space for Snooping Protocols
5.4.1 A 3-state (MSI) Write-back Invalidation Protocol
5.4.2 A 4-state (MESI)
Write-Back Invalidation Protocol
5.4.3 A 4-state (Dragon) Write-back Update Protocol
5.5 Assessing Protocol
Design Tradeoffs
5.5.1 Workloads
5.5.2 Impact of Protocol Optimizations
5.5.3 Tradeoffs in Cache Block Size
5.5.4 Update-based vs. Invalidation-based Protocols
5.6 Synchronization
5.6.1 Components of a Synchronization
Event
5.6.2 Role of User, System Software and Hardware
5.6.3 Mutual Exclusion
5.6.4 Point-to-point Event Synchronization
5.6.5 Global (Barrier) Event Synchronization
5.6.6 Synchronization Summary
5.7 Implications for Software
5.8 Concluding Remarks
5.9 References
5.10 Exercises
6 Snoop-based Multiprocessor Design
6.1 Introduction
6.2 Correctness Requirements
6.3 Base Design: Single-level Caches with an Atomic Bus
6.3.1
Cache controller and tags
6.3.2 Reporting snoop results
6.3.3 Dealing with Writebacks
6.3.4 Base Organization
6.3.5 Non-atomic State Transitions
6.3.6 Serialization
6.3.7 Deadlock
6.3.8 Livelock and Starvation
6.3.9 Implementing Atomic Primitives
6.4 Multi-level Cache Hierarchies
6.4.1 Maintaining inclusion
6.4.2 Propagating transactions for coherence in the hierarchy
6.5 Split-transaction Bus
6.5.1 An Example
Split-transaction Design
6.5.2 Bus Design and Request-response Matching
6.5.3 Snoop Results and Conflicting Requests
6.5.4 Flow Control
6.5.5 Path of a Cache Miss
6.5.6 Serialization and Sequential Consistency
6.5.7 Alternative
design choices
6.5.8 Putting it together with multi-level caches
6.5.9 Supporting Multiple Outstanding Misses from
a Processor
6.6 Case Studies: SGI Challenge and Sun Enterprise SMPs
6.6.1 SGI Powerpath-2 System Bus
6.6.2 SGI Processor and Memory Subsystems
6.6.3 SGI I/O Subsystem
6.6.4 SGI Challenge Memory System Performance
6.6.5 Sun Gigaplane System Bus
6.6.6 Sun Processor and Memory Subsystem
6.6.7 Sun I/O Subsystem
6.6.8
Sun Enterprise Memory System Performance
6.6.9 Application Performance
6.7 Extending Cache Coherence
6.7.1
Shared-Cache Designs
6.7.2 Coherence for Virtually Indexed Caches
6.7.3 Translation Lookaside Buffer Coherence
6.7.4 Cache coherence on Rings
6.7.5 Scaling Data Bandwidth and Snoop Bandwidth
6.8 Concluding Remarks
6.9 References
6.10 Exercises
7 Scalable Multiprocessors
7.1 Introduction
7.2 Scalability
7.2.1 Bandwidth Scaling
7.2.2 Latency Scaling
7.2.3 Cost Scaling
7.2.4 Physical
scaling
7.2.5 Scaling in a Generic Parallel Architecture
7.3 Realizing Programming Models
7.3.1 Primitive
Network Transactions
7.3.2 Shared Address Space
7.3.3 Message Passing
7.3.4 Common challenges
7.3.5
Communication architecture design space
7.4 Physical DMA
7.4.1 A Case Study: nCUBE/2
7.5
User-level Access
7.5.1 Case Study: Thinking Machines CM-5
7.5.2 User Level Handlers
7.6 Dedicated
Message Processing
7.6.1 Case Study: Intel Paragon
7.6.2 Case Study: Meiko CS-2
7.7 Shared Physical
Address Space
7.7.1 Case study: Cray T3D
7.7.2 Cray T3E
7.7.3 Summary
7.8 Clusters and Networks
of Workstations
7.8.1 Case Study: Myrinet SBus Lanai
7.8.2 Case Study: PCI Memory Channel
7.9 Comparison
of Communication Performance
7.9.1 Network Transaction Performance
7.9.2 Shared Address Space Operations
7.9.3
Message Passing Operations
7.9.4 Application Level Performance
7.10 Synchronization
7.10.1 Algorithms
for Locks
7.10.2 Algorithms for Barriers
7.11 Concluding Remarks
7.12 References
7.13 Exercices
8 Directory-based Cache Coherence
8.1 Introduction
8.2 Scalable Cache Coherence
8.3 Overview
of Directory-Based Approaches
8.3.1 Operation of a Simple Directory Scheme
8.3.2 Scaling
8.3.3 Alternatives
for Organizing Directories
8.4 Assessing Directory Protocols and Tradeoffs
8.4.1 Data Sharing Patterns for
Directory Schemes
8.4.2 Local versus Remote Traffic
8.4.3 Cache Block Size Effects
8.5 Design Challenges
for Directory Protocols
8.5.1 Performance
8.5.2 Correctness
8.5.3 Overview of Origin2000 Hardware
8.6 Cache-based Directory Protocols: The Sequent NUMA-Q
8.6.1 Cache Coherence Protocol
8.6.2 Dealing with Correctness
Issues
8.6.3 Protocol Extensions
8.6.4 Overview of NUMA-Q Hardware
8.6.5 Protocol Interactions with SMP Node
8.6.6 IQ-Link Implementation
8.6.7 Performance Characteristics
8.6.8 Comparison Case Study: The HAL S1 Multiprocessor
8.7 Performance Parameters and Protocol Performance
8.8 Synchronization
8.8.1 Performance of Synchronization Algorithms
8.8.2 Supporting Atomic Primitives
8.9 Implications for Parallel Software
8.10 Advanced Topics
8.10.1
Reducing Directory Storage Overhead
8.10.2 Hierarchical Coherence
8.11 Concluding Remarks
8.12 References
8.13 Exercises
9 Hardware-Software Tradeoffs
9.1 Introduction
9.2 Relaxed Memory
Consistency Models
9.2.1 System Specifications
9.2.2 The Programming Interface
9.2.3 Translation Mechanisms
9.2.4 Consistency Models in Real Systems
9.3 Overcoming Capacity Limitations
9.3.1 Tertiary Caches
9.3.2 Cache-only Memory Architectures (COMA)
9.4 Reducing Hardware Cost
9.4.1 Hardware Access Control
with a Decoupled Assist
Books and book related electronic products are priced in US dollars (USD), euro (EUR), and Great Britain Pounds (GBP). USD prices apply to the Americas and Asia Pacific. EUR prices apply in Europe and the Middle East. GBP prices apply to the UK and all other countries.