Parallel Computer Architecture book cover

Parallel Computer Architecture

A Hardware/Software Approach

The most exciting development in parallel computer architecture is the convergence of traditionally disparate approaches on a common machine structure. This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures. It then examines the design issues that are critical to all parallel architecture across the full range of modern design, covering data access, communication performance, coordination of cooperative work, and correct implementation of useful semantics. It not only describes the hardware and software techniques for addressing each of these issues but also explores how these techniques interact in the same system. Examining architecture from an application-driven perspective, it provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions.

Hardbound, 1056 Pages

Published: August 1998

Imprint: Morgan Kaufmann

ISBN: 978-1-55860-343-1

Contents

    • 1 Introduction
      • 1.1 Introduction
      • 1.2 Why Parallel Architecture
        • 1.2.1 Application Trends
        • 1.2.2 Technology Trends
        • 1.2.3 Architectural Trends
        • 1.2.4 Supercomputers
        • 1.2.5 Summary
      • 1.3 Convergence of Parallel Architectures
        • 1.3.1 Communication Architecture
        • 1.3.2 Shared Memory
        • 1.3.3 Message-Passing
        • 1.3.4 Convergence
        • 1.3.5 Data Parallel Processing
        • 1.3.6 Other Parallel Architectures
        • 1.3.7 A Generic Parallel Architecture
      • 1.4 Fundamental Design Issues
        • 1.4.1 Communication Abstraction
        • 1.4.2 Programming Model Requirements
        • 1.4.3 Naming
        • 1.4.4 Ordering
        • 1.4.5 Communication and Replication
        • 1.4.6 Performance
      • 1.5 Concluding Remarks
      • 1.6 References
      • 1.7 Exercises
    • 2 Parallel Programs
      • 2.1 Introduction
      • 2.2 Parallel Application Case Studies
        • 2.2.1 Simulating Ocean Currents
        • 2.2.2 Simulating the Evolution of Galaxies
        • 2.2.3 Visualizing Complex Scenes using Ray Tracing
        • 2.2.4 Mining Data for Associations
      • 2.3 The Parallelization Process
        • 2.3.1 Steps in the Process
        • 2.3.2 Parallelizing Computation versus Data
        • 2.3.3 Goals of the Parallelization Process
      • 2.4 Parallelization of an Example Program
        • 2.4.1 A Simple Example: The Equation Solver Kernel
        • 2.4.2 Decomposition
        • 2.4.3 Assignment
        • 2.4.4 Orchestration under the Data Parallel Model
        • 2.4.5 Orchestration under the Shared Address Space Model
        • 2.4.6 Orchestration under the Message Passing Model
      • 2.5 Concluding Remarks
      • 2.6 References
      • 2.7 Exercises
    • 3 Programming for Performance
      • 3.1 Introduction
      • 3.2 Partitioning for Performance
        • 3.2.1 Load Balance and Synchronization Wait Time
        • 3.2.2 Reducing Inherent Communication
        • 3.2.3 Reducing the Extra Work
        • 3.2.4 Summary
      • 3.3 Data Access and Communication in a Multi-Memory System
        • 3.3.1 A Multiprocessor as an Extended Memory Hierarchy
        • 3.3.2 Artifactual Communication in the Extended Memory Hierarchy
      • 3.4 Orchestration for Performance
        • 3.4.1 Reducing Artifactual Communication
        • 3.4.2 Structuring Communication to Reduce Cost
      • 3.5 Performance Factors from the ProcessorsÕ Perspective
      • 3.6 The Parallel Application Case Studies: An In-Depth Look
        • 3.6.1 Ocean
        • 3.6.2 Barnes-Hut
        • 3.6.3 Raytrace
        • 3.6.4 Data Mining
      • 3.7 Implications for Programming Models
      • 3.8 Concluding Remarks
      • 3.9 References
      • 3.10 Exercises
    • 4 Workload-Driven Evaluation
      • 4.1 Introduction
      • 4.2 Scaling Workloads and Machines
        • 4.2.1 Why Worry about Scaling?
        • 4.2.2 Key Issues in Scaling
        • 4.2.3 Scaling Models
        • 4.2.4 Scaling Workload Parameters
      • 4.3 Evaluating a Real Machine
        • 4.3.1 Performance Isolation using Microbenchmarks
        • 4.3.2 Choosing Workloads
        • 4.3.3 Evaluating a Fixed-Size Machine
        • 4.3.4 Varying Machine Size
        • 4.3.5 Choosing Performance Metrics
        • 4.3.6 Presenting Results
      • 4.4 Evaluating an Architectural Idea or Tradeoff
        • 4.4.1 Multiprocessor Simulation
        • 4.4.2 Scaling Down Problem and Machine Parameters for Simulation
        • 4.4.3 Dealing with the Parameter Space: An Example Evaluation
        • 4.4.4 Summary
      • 4.5 Illustrating Workload Characterization
        • 4.5.1 Workload Case Studies
        • 4.5.2 Workload Characteristics
      • 4.6 Concluding Remarks
      • 4.7 References
      • 4.8 Exercises
    • 5 Shared Memory Multiprocessors
      • 5.1 Introduction
      • 5.2 Cache Coherence
        • 5.2.1 The Cache Coherence Problem
        • 5.2.2 Cache Coherence Through Bus Snooping
      • 5.3 Memory Consistency
        • 5.3.1 Sequential Consistency
        • 5.3.2 Sufficient Conditions for Preserving Sequential Consistency
      • 5.4 Design Space for Snooping Protocols
        • 5.4.1 A 3-state (MSI) Write-back Invalidation Protocol
        • 5.4.2 A 4-state (MESI) Write-Back Invalidation Protocol
        • 5.4.3 A 4-state (Dragon) Write-back Update Protocol
      • 5.5 Assessing Protocol Design Tradeoffs
        • 5.5.1 Workloads
        • 5.5.2 Impact of Protocol Optimizations
        • 5.5.3 Tradeoffs in Cache Block Size
        • 5.5.4 Update-based vs. Invalidation-based Protocols
      • 5.6 Synchronization
        • 5.6.1 Components of a Synchronization Event
        • 5.6.2 Role of User, System Software and Hardware
        • 5.6.3 Mutual Exclusion
        • 5.6.4 Point-to-point Event Synchronization
        • 5.6.5 Global (Barrier) Event Synchronization
        • 5.6.6 Synchronization Summary
      • 5.7 Implications for Software
      • 5.8 Concluding Remarks
      • 5.9 References
      • 5.10 Exercises
    • 6 Snoop-based Multiprocessor Design
      • 6.1 Introduction
      • 6.2 Correctness Requirements
      • 6.3 Base Design: Single-level Caches with an Atomic Bus
        • 6.3.1 Cache controller and tags
        • 6.3.2 Reporting snoop results
        • 6.3.3 Dealing with Writebacks
        • 6.3.4 Base Organization
        • 6.3.5 Non-atomic State Transitions
        • 6.3.6 Serialization
        • 6.3.7 Deadlock
        • 6.3.8 Livelock and Starvation
        • 6.3.9 Implementing Atomic Primitives
      • 6.4 Multi-level Cache Hierarchies
        • 6.4.1 Maintaining inclusion
        • 6.4.2 Propagating transactions for coherence in the hierarchy
      • 6.5 Split-transaction Bus
        • 6.5.1 An Example Split-transaction Design
        • 6.5.2 Bus Design and Request-response Matching
        • 6.5.3 Snoop Results and Conflicting Requests
        • 6.5.4 Flow Control
        • 6.5.5 Path of a Cache Miss
        • 6.5.6 Serialization and Sequential Consistency
        • 6.5.7 Alternative design choices
        • 6.5.8 Putting it together with multi-level caches
        • 6.5.9 Supporting Multiple Outstanding Misses from a Processor
      • 6.6 Case Studies: SGI Challenge and Sun Enterprise SMPs
        • 6.6.1 SGI Powerpath-2 System Bus
        • 6.6.2 SGI Processor and Memory Subsystems
        • 6.6.3 SGI I/O Subsystem
        • 6.6.4 SGI Challenge Memory System Performance
        • 6.6.5 Sun Gigaplane System Bus
        • 6.6.6 Sun Processor and Memory Subsystem
        • 6.6.7 Sun I/O Subsystem
        • 6.6.8 Sun Enterprise Memory System Performance
        • 6.6.9 Application Performance
      • 6.7 Extending Cache Coherence
        • 6.7.1 Shared-Cache Designs
        • 6.7.2 Coherence for Virtually Indexed Caches
        • 6.7.3 Translation Lookaside Buffer Coherence
        • 6.7.4 Cache coherence on Rings
        • 6.7.5 Scaling Data Bandwidth and Snoop Bandwidth
      • 6.8 Concluding Remarks
      • 6.9 References
      • 6.10 Exercises
    • 7 Scalable Multiprocessors
      • 7.1 Introduction
      • 7.2 Scalability
        • 7.2.1 Bandwidth Scaling
        • 7.2.2 Latency Scaling
        • 7.2.3 Cost Scaling
        • 7.2.4 Physical scaling
        • 7.2.5 Scaling in a Generic Parallel Architecture
      • 7.3 Realizing Programming Models
        • 7.3.1 Primitive Network Transactions
        • 7.3.2 Shared Address Space
        • 7.3.3 Message Passing
        • 7.3.4 Common challenges
        • 7.3.5 Communication architecture design space
      • 7.4 Physical DMA
        • 7.4.1 A Case Study: nCUBE/2
      • 7.5 User-level Access
        • 7.5.1 Case Study: Thinking Machines CM-5
        • 7.5.2 User Level Handlers
      • 7.6 Dedicated Message Processing
        • 7.6.1 Case Study: Intel Paragon
        • 7.6.2 Case Study: Meiko CS-2
      • 7.7 Shared Physical Address Space
        • 7.7.1 Case study: Cray T3D
        • 7.7.2 Cray T3E
        • 7.7.3 Summary
      • 7.8 Clusters and Networks of Workstations
        • 7.8.1 Case Study: Myrinet SBus Lanai
        • 7.8.2 Case Study: PCI Memory Channel
      • 7.9 Comparison of Communication Performance
        • 7.9.1 Network Transaction Performance
        • 7.9.2 Shared Address Space Operations
        • 7.9.3 Message Passing Operations
        • 7.9.4 Application Level Performance
      • 7.10 Synchronization
        • 7.10.1 Algorithms for Locks
        • 7.10.2 Algorithms for Barriers
      • 7.11 Concluding Remarks
      • 7.12 References
      • 7.13 Exercices
    • 8 Directory-based Cache Coherence
      • 8.1 Introduction
      • 8.2 Scalable Cache Coherence
      • 8.3 Overview of Directory-Based Approaches
        • 8.3.1 Operation of a Simple Directory Scheme
        • 8.3.2 Scaling
        • 8.3.3 Alternatives for Organizing Directories
      • 8.4 Assessing Directory Protocols and Tradeoffs
        • 8.4.1 Data Sharing Patterns for Directory Schemes
        • 8.4.2 Local versus Remote Traffic
        • 8.4.3 Cache Block Size Effects
      • 8.5 Design Challenges for Directory Protocols
        • 8.5.1 Performance
        • 8.5.2 Correctness
        • 8.5.3 Overview of Origin2000 Hardware
      • 8.6 Cache-based Directory Protocols: The Sequent NUMA-Q
        • 8.6.1 Cache Coherence Protocol
        • 8.6.2 Dealing with Correctness Issues
        • 8.6.3 Protocol Extensions
        • 8.6.4 Overview of NUMA-Q Hardware
        • 8.6.5 Protocol Interactions with SMP Node
        • 8.6.6 IQ-Link Implementation
        • 8.6.7 Performance Characteristics
        • 8.6.8 Comparison Case Study: The HAL S1 Multiprocessor
      • 8.7 Performance Parameters and Protocol Performance
      • 8.8 Synchronization
        • 8.8.1 Performance of Synchronization Algorithms
        • 8.8.2 Supporting Atomic Primitives
      • 8.9 Implications for Parallel Software
      • 8.10 Advanced Topics
        • 8.10.1 Reducing Directory Storage Overhead
        • 8.10.2 Hierarchical Coherence
      • 8.11 Concluding Remarks
      • 8.12 References
      • 8.13 Exercises
    • 9 Hardware-Software Tradeoffs
      • 9.1 Introduction
      • 9.2 Relaxed Memory Consistency Models
        • 9.2.1 System Specifications
        • 9.2.2 The Programming Interface
        • 9.2.3 Translation Mechanisms
        • 9.2.4 Consistency Models in Real Systems
      • 9.3 Overcoming Capacity Limitations
        • 9.3.1 Tertiary Caches
        • 9.3.2 Cache-only Memory Architectures (COMA)
      • 9.4 Reducing Hardware Cost
        • 9.4.1 Hardware Access Control with a Decoupled Assist
        • 9.4.2 Access Control through Code Instrumentation
        • 9.4.3 Page-based Access Control: Shared Virtual Memory
        • 9.4.4 Access Control through Language and Compiler Support
      • 9.5 Putting it All Together: A Taxonomy and Simple-COMA
        • 9.5.1 Putting it All Together: Simple-COMA and Stache
      • 9.6 Implications for Parallel Software
      • 9.7 Advanced Topics
        • 9.7.1 Flexibility and Address Constraints in CC-NUMA Systems
        • 9.7.2 Implementing Relaxed Memory Consistency in Software
      • 9.8 Concluding Remarks
      • 9.9 References
      • 9.10 Exercises
    • 10 Interconnection Network Design
      • 10.1 Introduction
        • 10.1.1 Basic definitions
        • 10.1.2 Basic communication performance
      • 10.2 Organizational Structure
        • 10.2.1 Links
        • 10.2.2 Switches
        • 10.2.3 Network Interfaces
      • 10.3 Interconnection Topologies
        • 10.3.1 Fully connected network
        • 10.3.2 Linear arrays and rings
        • 10.3.3 Multidimensional meshes and tori
        • 10.3.4 Trees
        • 10.3.5 Butterflies
        • 10.3.6 Hypercubes
      • 10.4 Evaluating Design Trade-offs in Network Topology
        • 10.4.1 Unloaded Latency
        • 10.4.2 Latency under load
      • 10.5 Routing
        • 10.5.1 Routing Mechanisms
        • 10.5.2 Deterministic Routing
        • 10.5.3 Deadlock Freedom
        • 10.5.4 Virtual Channels
        • 10.5.5 Up*-Down* Routing
        • 10.5.6 Turn-model Routing
        • 10.5.7 Adaptive Routing
      • 10.6 Switch Design
        • 10.6.1 Ports
        • 10.6.2 Internal Datapath
        • 10.6.3 Channel Buffers
        • 10.6.4 Output Scheduling
        • 10.6.5 Stacked Dimension Switches
      • 10.7 Flow Control
        • 10.7.1 Parallel Computer Networks vs. LANs and WANs
        • 10.7.2 Link-level flow control
        • 10.7.3 End-to-end flow control
      • 10.8 Case Studies
        • 10.8.1 Cray T3D Network
        • 10.8.2 IBM SP-1, SP-2 Network
        • 10.8.3 Scalable Coherent Interconnect
        • 10.8.4 Myricom Network
      • 10.9 Concluding Remarks
      • 10.10 References
      • 10.11 Exercises
    • 11 Latency Tolerance
      • 11.1 Introduction
      • 11.2 Overview of Latency Tolerance
        • 11.2.1 Latency Tolerance and the Communication Pipeline
        • 11.2.2 Approaches
        • 11.2.3 Fundamental Requirements, Benefits and Limitations
      • 11.3 Latency Tolerance in Explicit Message Passing
        • 11.3.1 Structure of Communication
        • 11.3.2 Block Data Transfer
        • 11.3.3 Precommunication
        • 11.3.4 Proceeding Past Communication in the Same Thread
        • 11.3.5 Multithreading
      • 11.4 Latency Tolerance in a Shared Address Space
        • 11.4.1 Structure of Communication
      • 11.5 Block Data Transfer in a Shared Address Space
        • 11.5.1 Techniques and Mechanisms
        • 11.5.2 Policy Issues and Tradeoffs
        • 11.5.3 Performance Benefits
      • 11.6 Proceeding Past Long-latency Events in a Shared Address Space
        • 11.6.1 Proceeding Past Writes
        • 11.6.2 Proceeding Past Reads
        • 11.6.3 Implementation Issues
        • 11.6.4 Summary
      • 11.7 Precommunication in a Shared Address Space
        • 11.7.1 Shared Address Space With No Caching of Shared Data
        • 11.7.2 Cache-coherent Shared Address Space
        • 11.7.3 Performance Benefits
        • 11.7.4 Implementation Issues
        • 11.7.5 Summary
      • 11.8 Multithreading in a Shared Address Space
        • 11.8.1 Techniques and Mechanisms
        • 11.8.2 Performance Benefits
        • 11.8.3 Implementation Issues for the Blocked Scheme
        • 11.8.4 Implementation Issues for the Interleaved Scheme
        • 11.8.5 Integrating Multithreading with Multiple-Issue Processors
      • 11.9 Lockup-free Cache Design
      • 11.10 Concluding Remarks
      • 11.11 References
      • 11.12 Exercises
    • 12 Future Directions
      • 12.1 Technology and Architecture
        • 12.1.1 Evolutionary Scenario
        • 12.1.2 Hitting a Wall
        • 12.1.3 Potential Breakthroughs
      • 12.2 Applications and System Software
        • 12.2.1 Evolution
        • 12.2.2 Hitting a Wall
        • 12.2.3 Potential Breakthroughs
      • 12.3 References
    • APPENDIX A Parallel Benchmark Suites
      • A.1 ScaLapack
      • A.2 TPC
      • A.3 SPLASH
      • A.4 NAS Parallel Benchmarks
      • A.5 PARKBENCH
      • A.6 Other Ongoing Efforts

Advertisement

advert image