Save up to 30% on Elsevier print and eBooks with free shipping. No promo code needed.
Save up to 30% on print and eBooks.
Parallel Computer Architecture
A Hardware/Software Approach
1st Edition - August 1, 1998
Authors: David Culler, Jaswinder Pal Singh, Anoop Gupta
Language: English
Hardback ISBN:9781558603431
9 7 8 - 1 - 5 5 8 6 0 - 3 4 3 - 1
eBook ISBN:9780080573076
9 7 8 - 0 - 0 8 - 0 5 7 3 0 7 - 6
The most exciting development in parallel computer architecture is the convergence of traditionally disparate approaches on a common machine structure. This book explains the fo…Read more
Purchase options
LIMITED OFFER
Save 50% on book bundles
Immediately download your ebook while waiting for your print delivery. No promo code is needed.
The most exciting development in parallel computer architecture is the convergence of traditionally disparate approaches on a common machine structure. This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures. It then examines the design issues that are critical to all parallel architecture across the full range of modern design, covering data access, communication performance, coordination of cooperative work, and correct implementation of useful semantics. It not only describes the hardware and software techniques for addressing each of these issues but also explores how these techniques interact in the same system. Examining architecture from an application-driven perspective, it provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions.
synthesizes a decade of research and development for practicing engineers, graduate students, and researchers in parallel computer architecture, system software, and applications development
presents in-depth application case studies from computer graphics, computational science and engineering, and data mining to demonstrate sound quantitative evaluation of design trade-offs
describes the process of programming for performance, including both the architecture-independent and architecture-dependent aspects, with examples and case-studies
illustrates bus-based and network-based parallel systems with case studies of more than a dozen important commercial designs
1 Introduction
1.1 Introduction
1.2 Why Parallel Architecture
1.2.1 Application Trends
1.2.2 Technology Trends
1.2.3 Architectural Trends
1.2.4 Supercomputers
1.2.5 Summary
1.3 Convergence of Parallel Architectures
1.3.1 Communication Architecture
1.3.2 Shared Memory
1.3.3 Message-Passing
1.3.4 Convergence
1.3.5 Data Parallel Processing
1.3.6 Other Parallel Architectures
1.3.7 A Generic Parallel Architecture
1.4 Fundamental Design Issues
1.4.1 Communication Abstraction
1.4.2 Programming Model Requirements
1.4.3 Naming
1.4.4 Ordering
1.4.5 Communication and Replication
1.4.6 Performance
1.5 Concluding Remarks
1.6 References
1.7 Exercises
2 Parallel Programs
2.1 Introduction
2.2 Parallel Application Case Studies
2.2.1 Simulating Ocean Currents
2.2.2 Simulating the Evolution of Galaxies
2.2.3 Visualizing Complex Scenes using Ray Tracing
2.2.4 Mining Data for Associations
2.3 The Parallelization Process
2.3.1 Steps in the Process
2.3.2 Parallelizing Computation versus Data
2.3.3 Goals of the Parallelization Process
2.4 Parallelization of an Example Program
2.4.1 A Simple Example: The Equation Solver Kernel
2.4.2 Decomposition
2.4.3 Assignment
2.4.4 Orchestration under the Data Parallel Model
2.4.5 Orchestration under the Shared Address Space Model
2.4.6 Orchestration under the Message Passing Model
2.5 Concluding Remarks
2.6 References
2.7 Exercises
3 Programming for Performance
3.1 Introduction
3.2 Partitioning for Performance
3.2.1 Load Balance and Synchronization Wait Time
3.2.2 Reducing Inherent Communication
3.2.3 Reducing the Extra Work
3.2.4 Summary
3.3 Data Access and Communication in a Multi-Memory System
3.3.1 A Multiprocessor as an Extended Memory Hierarchy
3.3.2 Artifactual Communication in the Extended Memory Hierarchy
3.4 Orchestration for Performance
3.4.1 Reducing Artifactual Communication
3.4.2 Structuring Communication to Reduce Cost
3.5 Performance Factors from the ProcessorsÕ Perspective
3.6 The Parallel Application Case Studies: An In-Depth Look
3.6.1 Ocean
3.6.2 Barnes-Hut
3.6.3 Raytrace
3.6.4 Data Mining
3.7 Implications for Programming Models
3.8 Concluding Remarks
3.9 References
3.10 Exercises
4 Workload-Driven Evaluation
4.1 Introduction
4.2 Scaling Workloads and Machines
4.2.1 Why Worry about Scaling?
4.2.2 Key Issues in Scaling
4.2.3 Scaling Models
4.2.4 Scaling Workload Parameters
4.3 Evaluating a Real Machine
4.3.1 Performance Isolation using Microbenchmarks
4.3.2 Choosing Workloads
4.3.3 Evaluating a Fixed-Size Machine
4.3.4 Varying Machine Size
4.3.5 Choosing Performance Metrics
4.3.6 Presenting Results
4.4 Evaluating an Architectural Idea or Tradeoff
4.4.1 Multiprocessor Simulation
4.4.2 Scaling Down Problem and Machine Parameters for Simulation
4.4.3 Dealing with the Parameter Space: An Example Evaluation
4.4.4 Summary
4.5 Illustrating Workload Characterization
4.5.1 Workload Case Studies
4.5.2 Workload Characteristics
4.6 Concluding Remarks
4.7 References
4.8 Exercises
5 Shared Memory Multiprocessors
5.1 Introduction
5.2 Cache Coherence
5.2.1 The Cache Coherence Problem
5.2.2 Cache Coherence Through Bus Snooping
5.3 Memory Consistency
5.3.1 Sequential Consistency
5.3.2 Sufficient Conditions for Preserving Sequential Consistency
5.4 Design Space for Snooping Protocols
5.4.1 A 3-state (MSI) Write-back Invalidation Protocol
5.4.2 A 4-state (MESI) Write-Back Invalidation Protocol
5.4.3 A 4-state (Dragon) Write-back Update Protocol
5.5 Assessing Protocol Design Tradeoffs
5.5.1 Workloads
5.5.2 Impact of Protocol Optimizations
5.5.3 Tradeoffs in Cache Block Size
5.5.4 Update-based vs. Invalidation-based Protocols
5.6 Synchronization
5.6.1 Components of a Synchronization Event
5.6.2 Role of User, System Software and Hardware
5.6.3 Mutual Exclusion
5.6.4 Point-to-point Event Synchronization
5.6.5 Global (Barrier) Event Synchronization
5.6.6 Synchronization Summary
5.7 Implications for Software
5.8 Concluding Remarks
5.9 References
5.10 Exercises
6 Snoop-based Multiprocessor Design
6.1 Introduction
6.2 Correctness Requirements
6.3 Base Design: Single-level Caches with an Atomic Bus
6.3.1 Cache controller and tags
6.3.2 Reporting snoop results
6.3.3 Dealing with Writebacks
6.3.4 Base Organization
6.3.5 Non-atomic State Transitions
6.3.6 Serialization
6.3.7 Deadlock
6.3.8 Livelock and Starvation
6.3.9 Implementing Atomic Primitives
6.4 Multi-level Cache Hierarchies
6.4.1 Maintaining inclusion
6.4.2 Propagating transactions for coherence in the hierarchy
6.5 Split-transaction Bus
6.5.1 An Example Split-transaction Design
6.5.2 Bus Design and Request-response Matching
6.5.3 Snoop Results and Conflicting Requests
6.5.4 Flow Control
6.5.5 Path of a Cache Miss
6.5.6 Serialization and Sequential Consistency
6.5.7 Alternative design choices
6.5.8 Putting it together with multi-level caches
6.5.9 Supporting Multiple Outstanding Misses from a Processor
6.6 Case Studies: SGI Challenge and Sun Enterprise SMPs
6.6.1 SGI Powerpath-2 System Bus
6.6.2 SGI Processor and Memory Subsystems
6.6.3 SGI I/O Subsystem
6.6.4 SGI Challenge Memory System Performance
6.6.5 Sun Gigaplane System Bus
6.6.6 Sun Processor and Memory Subsystem
6.6.7 Sun I/O Subsystem
6.6.8 Sun Enterprise Memory System Performance
6.6.9 Application Performance
6.7 Extending Cache Coherence
6.7.1 Shared-Cache Designs
6.7.2 Coherence for Virtually Indexed Caches
6.7.3 Translation Lookaside Buffer Coherence
6.7.4 Cache coherence on Rings
6.7.5 Scaling Data Bandwidth and Snoop Bandwidth
6.8 Concluding Remarks
6.9 References
6.10 Exercises
7 Scalable Multiprocessors
7.1 Introduction
7.2 Scalability
7.2.1 Bandwidth Scaling
7.2.2 Latency Scaling
7.2.3 Cost Scaling
7.2.4 Physical scaling
7.2.5 Scaling in a Generic Parallel Architecture
7.3 Realizing Programming Models
7.3.1 Primitive Network Transactions
7.3.2 Shared Address Space
7.3.3 Message Passing
7.3.4 Common challenges
7.3.5 Communication architecture design space
7.4 Physical DMA
7.4.1 A Case Study: nCUBE/2
7.5 User-level Access
7.5.1 Case Study: Thinking Machines CM-5
7.5.2 User Level Handlers
7.6 Dedicated Message Processing
7.6.1 Case Study: Intel Paragon
7.6.2 Case Study: Meiko CS-2
7.7 Shared Physical Address Space
7.7.1 Case study: Cray T3D
7.7.2 Cray T3E
7.7.3 Summary
7.8 Clusters and Networks of Workstations
7.8.1 Case Study: Myrinet SBus Lanai
7.8.2 Case Study: PCI Memory Channel
7.9 Comparison of Communication Performance
7.9.1 Network Transaction Performance
7.9.2 Shared Address Space Operations
7.9.3 Message Passing Operations
7.9.4 Application Level Performance
7.10 Synchronization
7.10.1 Algorithms for Locks
7.10.2 Algorithms for Barriers
7.11 Concluding Remarks
7.12 References
7.13 Exercices
8 Directory-based Cache Coherence
8.1 Introduction
8.2 Scalable Cache Coherence
8.3 Overview of Directory-Based Approaches
8.3.1 Operation of a Simple Directory Scheme
8.3.2 Scaling
8.3.3 Alternatives for Organizing Directories
8.4 Assessing Directory Protocols and Tradeoffs
8.4.1 Data Sharing Patterns for Directory Schemes
8.4.2 Local versus Remote Traffic
8.4.3 Cache Block Size Effects
8.5 Design Challenges for Directory Protocols
8.5.1 Performance
8.5.2 Correctness
8.5.3 Overview of Origin2000 Hardware
8.6 Cache-based Directory Protocols: The Sequent NUMA-Q
8.6.1 Cache Coherence Protocol
8.6.2 Dealing with Correctness Issues
8.6.3 Protocol Extensions
8.6.4 Overview of NUMA-Q Hardware
8.6.5 Protocol Interactions with SMP Node
8.6.6 IQ-Link Implementation
8.6.7 Performance Characteristics
8.6.8 Comparison Case Study: The HAL S1 Multiprocessor
8.7 Performance Parameters and Protocol Performance
8.8 Synchronization
8.8.1 Performance of Synchronization Algorithms
8.8.2 Supporting Atomic Primitives
8.9 Implications for Parallel Software
8.10 Advanced Topics
8.10.1 Reducing Directory Storage Overhead
8.10.2 Hierarchical Coherence
8.11 Concluding Remarks
8.12 References
8.13 Exercises
9 Hardware-Software Tradeoffs
9.1 Introduction
9.2 Relaxed Memory Consistency Models
9.2.1 System Specifications
9.2.2 The Programming Interface
9.2.3 Translation Mechanisms
9.2.4 Consistency Models in Real Systems
9.3 Overcoming Capacity Limitations
9.3.1 Tertiary Caches
9.3.2 Cache-only Memory Architectures (COMA)
9.4 Reducing Hardware Cost
9.4.1 Hardware Access Control with a Decoupled Assist
9.4.4 Access Control through Language and Compiler Support
9.5 Putting it All Together: A Taxonomy and Simple-COMA
9.5.1 Putting it All Together: Simple-COMA and Stache
9.6 Implications for Parallel Software
9.7 Advanced Topics
9.7.1 Flexibility and Address Constraints in CC-NUMA Systems
9.7.2 Implementing Relaxed Memory Consistency in Software
9.8 Concluding Remarks
9.9 References
9.10 Exercises
10 Interconnection Network Design
10.1 Introduction
10.1.1 Basic definitions
10.1.2 Basic communication performance
10.2 Organizational Structure
10.2.1 Links
10.2.2 Switches
10.2.3 Network Interfaces
10.3 Interconnection Topologies
10.3.1 Fully connected network
10.3.2 Linear arrays and rings
10.3.3 Multidimensional meshes and tori
10.3.4 Trees
10.3.5 Butterflies
10.3.6 Hypercubes
10.4 Evaluating Design Trade-offs in Network Topology
10.4.1 Unloaded Latency
10.4.2 Latency under load
10.5 Routing
10.5.1 Routing Mechanisms
10.5.2 Deterministic Routing
10.5.3 Deadlock Freedom
10.5.4 Virtual Channels
10.5.5 Up*-Down* Routing
10.5.6 Turn-model Routing
10.5.7 Adaptive Routing
10.6 Switch Design
10.6.1 Ports
10.6.2 Internal Datapath
10.6.3 Channel Buffers
10.6.4 Output Scheduling
10.6.5 Stacked Dimension Switches
10.7 Flow Control
10.7.1 Parallel Computer Networks vs. LANs and WANs
10.7.2 Link-level flow control
10.7.3 End-to-end flow control
10.8 Case Studies
10.8.1 Cray T3D Network
10.8.2 IBM SP-1, SP-2 Network
10.8.3 Scalable Coherent Interconnect
10.8.4 Myricom Network
10.9 Concluding Remarks
10.10 References
10.11 Exercises
11 Latency Tolerance
11.1 Introduction
11.2 Overview of Latency Tolerance
11.2.1 Latency Tolerance and the Communication Pipeline
11.2.2 Approaches
11.2.3 Fundamental Requirements, Benefits and Limitations
11.3 Latency Tolerance in Explicit Message Passing
11.3.1 Structure of Communication
11.3.2 Block Data Transfer
11.3.3 Precommunication
11.3.4 Proceeding Past Communication in the Same Thread
11.3.5 Multithreading
11.4 Latency Tolerance in a Shared Address Space
11.4.1 Structure of Communication
11.5 Block Data Transfer in a Shared Address Space
11.5.1 Techniques and Mechanisms
11.5.2 Policy Issues and Tradeoffs
11.5.3 Performance Benefits
11.6 Proceeding Past Long-latency Events in a Shared Address Space
11.6.1 Proceeding Past Writes
11.6.2 Proceeding Past Reads
11.6.3 Implementation Issues
11.6.4 Summary
11.7 Precommunication in a Shared Address Space
11.7.1 Shared Address Space With No Caching of Shared Data
11.7.2 Cache-coherent Shared Address Space
11.7.3 Performance Benefits
11.7.4 Implementation Issues
11.7.5 Summary
11.8 Multithreading in a Shared Address Space
11.8.1 Techniques and Mechanisms
11.8.2 Performance Benefits
11.8.3 Implementation Issues for the Blocked Scheme
11.8.4 Implementation Issues for the Interleaved Scheme
11.8.5 Integrating Multithreading with Multiple-Issue Processors
11.9 Lockup-free Cache Design
11.10 Concluding Remarks
11.11 References
11.12 Exercises
12 Future Directions
12.1 Technology and Architecture
12.1.1 Evolutionary Scenario
12.1.2 Hitting a Wall
12.1.3 Potential Breakthroughs
12.2 Applications and System Software
12.2.1 Evolution
12.2.2 Hitting a Wall
12.2.3 Potential Breakthroughs
12.3 References
APPENDIX A Parallel Benchmark Suites
A.1 ScaLapack
A.2 TPC
A.3 SPLASH
A.4 NAS Parallel Benchmarks
A.5 PARKBENCH
A.6 Other Ongoing Efforts
No. of pages: 1056
Language: English
Edition: 1
Published: August 1, 1998
Imprint: Morgan Kaufmann
Hardback ISBN: 9781558603431
eBook ISBN: 9780080573076
DC
David Culler
David Culler led the Berkeley
Network of Workstations (NOW) project, which sparked the
current commercial revolution in high-performance clusters.
Anoop Gupta co-led the Stanford DASH multiprocessor project,
which developed the shared-memory technology increasingly
used in commercial machines.
Affiliations and expertise
University of California, Berkeley
JS
Jaswinder Pal Singh
Jaswinder Pal Singh led the
development of the SPLASH and SPLASH-2 suites of parallel
programs, which have defined the workloads and methodology
used to drive decisions and evaluate trade-offs in shared-
memory parallel architecture.
Affiliations and expertise
Princeton University
AG
Anoop Gupta
Dr. Anoop Gupta is a Distinguished Scientist at Microsoft Research. He works on cross-disciplinary projects that have potential for large business or societal impact. His recent projects focus on areas of education, communication, collaboration, and natural user interfaces.