Programming Massively Parallel Processors book cover

Programming Massively Parallel Processors

A Hands-on Approach

Programming Massively Parallel Processors: A Hands-on Approach shows both student and professional alike the basic concepts of parallel programming and GPU architecture. Various techniques for constructing parallel programs are explored in detail. Case studies demonstrate the development process, which begins with computational thinking and ends with effective and efficient parallel programs. Topics of performance, floating-point format, parallel patterns, and dynamic parallelism are covered in depth.

This best-selling guide to CUDA and GPU parallel programming has been revised with more parallel programming examples, commonly-used libraries such as Thrust, and explanations of the latest tools. With these improvements, the book retains its concise, intuitive, practical approach based on years of road-testing in the authors' own parallel computing courses.


Advanced students, software engineers, programmers, hardware engineers

Paperback, 514 Pages

Published: December 2012

Imprint: Morgan Kaufmann

ISBN: 978-0-12-415992-1


  • "For those interested in the GPU path to parallel enlightenment, this new book from David Kirk and Wen-mei Hwu is a godsend, as it introduces CUDA (tm), a C-like data parallel language, and Tesla(tm), the architecture of the current generation of NVIDIA GPUs. In addition to explaining the language and the architecture, they define the nature of data parallel problems that run well on the heterogeneous CPU-GPU hardware ... This book is a valuable addition to the recently reinvigorated parallel computing literature."
    - David Patterson, Director of The Parallel Computing Research Laboratory and the Pardee Professor of Computer Science, U.C. Berkeley. Co-author of Computer Architecture: A Quantitative Approach

    "Written by two teaching pioneers, this book is the definitive practical reference on programming massively parallel processors--a true technological gold mine. The hands-on learning included is cutting-edge, yet very readable. This is a most rewarding read for students, engineers, and scientists interested in supercharging computational resources to solve today's and tomorrow's hardest problems."
    - Nicolas Pinto, MIT, NVIDIA Fellow, 2009

    "I have always admired Wen-mei Hwu's and David Kirk's ability to turn complex problems into easy-to-comprehend concepts. They have done it again in this book. This joint venture of a passionate teacher and a GPU evangelizer tackles the trade-off between the simple explanation of the concepts and the in-depth analysis of the programming techniques. This is a great book to learn both massive parallel programming and CUDA."
    - Mateo Valero, Director, Barcelona Supercomputing Center

    "The use of GPUs is having a big impact in scientific computing. David Kirk and Wen-mei Hwu's new book is an important contribution towards educating our students on the ideas and techniques of programming for massively parallel processors."
    - Mike Giles, Professor of Scientific Computing, University of Oxford

    "This book is the most comprehensive and authoritative introduction to GPU computing yet. David Kirk and Wen-mei Hwu are the pioneers in this increasingly important field, and their insights are invaluable and fascinating. This book will be the standard reference for years to come."
    - Hanspeter Pfister, Harvard University

    "This is a vital and much-needed text. GPU programming is growing by leaps and bounds. This new book will be very welcomed and highly useful across inter-disciplinary fields."
    - Shannon Steinfadt, Kent State University

    "GPUs have hundreds of cores capable of delivering transformative performance increases across a wide range of computational challenges. The rise of these multi-core architectures has raised the need to teach advanced programmers a new and essential skill: how to program massively parallel processors." –


  • CHAPTER 1 Introduction
    1.1 Heterogeneous Parallel Computing
    1.2 Architecture of a Modern GPU
    1.3 Why More Speed or Parallelism?
    1.4 Speeding Up Real Applications
    1.5 Parallel Programming Languages and Models
    1.6 Overarching Goals
    1.7 Organization of the Book

    CHAPTER 2 History of GPU Computing
    2.1 Evolution of Graphics Pipelines
    2.2 GPGPU: An Intermediate Step
    2.3 GPU Computing

    CHAPTER 3 Introduction to Data Parallelism and CUDA C
    3.1 Data Parallelism
    3.2 CUDA Program Structure
    3.3 A Vector Addition Kernel
    3.4 Device Global Memory and Data Transfer
    3.5 Kernel Functions and Threading
    3.6 Summary
    3.7 Exercises

    CHAPTER 4 Data-Parallel Execution Model
    4.1 Cuda Thread Organization
    4.2 Mapping Threads to Multidimensional Data
    4.3 Matrix-Matrix Multiplication—A More Complex Kernel
    4.4 Synchronization and Transparent Scalability
    4.5 Assigning Resources to Blocks
    4.6 Querying Device Properties
    4.7 Thread Scheduling and Latency Tolerance
    4.8 Summary
    4.9 Exercises

    CHAPTER 5 CUDA Memories
    5.1 Importance of Memory Access Efficiency
    5.2 CUDA Device Memory Types
    5.3 A Strategy for Reducing Global Memory Traffic
    5.4 A Tiled Matrix-Matrix Multiplication Kernel
    5.5 Memory as a Limiting Factor to Parallelism
    5.6 Summary
    5.7 Exercises

    CHAPTER 6 Performance Considerations
    6.1 Warps and Thread Execution
    6.2 Global Memory Bandwidth
    6.3 Dynamic Partitioning of Execution Resources
    6.4 Instruction Mix and Thread Granularity
    6.5 Summary
    6.6 Exercises

    CHAPTER 7 Floating-Point Considerations
    7.1 Floating-Point Format
    7.2 Representable Numbers
    7.3 Special Bit Patterns and Precision in IEEE Format
    7.4 Arithmetic Accuracy and Rounding
    7.5 Algorithm Considerations
    7.6 Numerical Stability
    7.7 Summary
    7.8 Exercises

    CHAPTER 8 Parallel Patterns: Convolution
    8.1 Background
    8.2 1D Parallel Convolution—A Basic Algorithm
    8.3 Constant Memory and Caching
    8.4 Tiled 1D Convolution with Halo Elements
    8.5 A Simpler Tiled 1D Convolution—General Caching
    8.6 Summary
    8.7 Exercises

    CHAPTER 9 Parallel Patterns: Prefix Sum
    9.1 Background
    9.2 A Simple Parallel Scan
    9.3 Work Efficiency Considerations
    9.4 A Work-Efficient Parallel Scan
    9.5 Parallel Scan for Arbitrary-Length Inputs
    9.6 Summary
    9.7 Exercises

    CHAPTER 10 Parallel Patterns: Sparse Matrix-Vector Multiplication
    10.1 Background
    10.2 Parallel SpMV Using CSR
    10.3 Padding and Transposition
    10.4 Using Hybrid to Control Padding
    10.5 Sorting and Partitioning for Regularization
    10.6 Summary
    10.7 Exercises

    CHAPTER 11 Application Case Study: Advanced MRI Reconstruction
    11.1 Application Background
    11.2 Iterative Reconstruction
    11.3 Computing FHD
    11.4 Final Evaluation
    11.5 Exercises

    CHAPTER 12 Application Case Study: Molecular Visualization and Analysis
    12.1 Application Background
    12.2 A Simple Kernel Implementation
    12.3 Thread Granularity Adjustment
    12.4 Memory Coalescing
    12.5 Summary
    12.6 Exercises

    CHAPTER 13 Parallel Programming and Computational Thinking
    13.1 Goals of Parallel Computing
    13.2 Problem Decomposition
    13.3 Algorithm Selection
    13.4 Computational Thinking
    13.5 Summary
    13.6 Exercises

    CHAPTER 14 An Introduction to OpenCL
    14.1 Background
    14.2 Data Parallelism Model
    14.3 Device Architecture
    14.4 Kernel Functions
    14.5 Device Management and Kernel Launch
    14.6 Electrostatic Potential Map in OpenCL
    14.7 Summary
    14.8 Exercises

    CHAPTER 15 Parallel Programming with OpenACC
    15.1 OpenACC Versus CUDA C
    15.2 Execution Model
    15.3 Memory Model
    15.4 Basic OpenACC Programs
    15.5 Future Directions of OpenACC
    15.6 Exercises

    CHAPTER 16 Thrust: A Productivity-Oriented Library for CUDA
    16.1 Background
    16.2 Motivation
    16.3 Basic Thrust Features
    16.4 Generic Programming
    16.5 Benefits of Abstraction
    16.6 Programmer Productivity
    16.7 Best Practices
    16.8 Exercises

    17.1 CUDA FORTRAN and CUDA C Differences
    17.2 A First CUDA FORTRAN Program
    17.3 Multidimensional Array in CUDA FORTRAN
    17.4 Overloading Host/Device Routines With Generic
    17.5 Calling CUDA C Via Iso_C_Binding
    17.6 Kernel Loop Directives and Reduction Operations
    17.7 Dynamic Shared Memory
    17.8 Asynchronous Data Transfers
    17.9 Compilation and Profiling
    17.10 Calling Thrust from CUDA FORTRAN
    17.11 Exercises

    CHAPTER 18 An Introduction to C11 AMP
    18.1 Core C11 Amp Features
    18.2 Details of the C11 AMP Execution Model
    18.3 Managing Accelerators
    18.4 Tiled Execution
    18.5 C11 AMP Graphics Features
    18.6 Summary
    18.7 Exercises

    CHAPTER 19 Programming a Heterogeneous
    Computing Cluster
    19.1 Background
    19.2 A Running Example
    19.3 MPI Basics
    19.4 MPI Point-to-Point Communication Types
    19.5 Overlapping Computation and Communication
    19.6 MPI Collective Communication
    19.7 Summary
    19.8 Exercises

    CHAPTER 20 CUDA Dynamic Parallelism
    20.1 Background
    20.2 Dynamic Parallelism Overview
    20.3 Important Details
    20.4 Memory Visibility
    20.5 A Simple Example
    20.6 Runtime Limitations
    20.7 A More Complex Example
    20.8 Summary

    CHAPTER 21 Conclusion and Future Outlook
    21.1 Goals Revisited
    21.2 Memory Model Evolution
    21.3 Kernel Execution Control Evolution
    21.4 Core Performance
    21.5 Programming Environment
    21.6 Future Outlook

    Appendix A: Matrix Multiplication Host-Only Version Source Code
    Appendix B: GPU Compute Capabilities



advert image