Programming Massively Parallel Processors

Programming Massively Parallel Processors

A Hands-on Approach

First published on December 14, 2012

Write a review

  • Authors: David Kirk, Wen-mei W. Hwu
  • eBook ISBN: 9780123914187

Purchase options

Purchase options
DRM-free (PDF, Mobi, EPub)
Sales tax will be calculated at check-out

Institutional Subscription

Free Global Shipping
No minimum order


Programming Massively Parallel Processors: A Hands-on Approach, Second Edition, teaches students how to program massively parallel processors. It offers a detailed discussion of various techniques for constructing parallel programs. Case studies are used to demonstrate the development process, which begins with computational thinking and ends with effective and efficient parallel programs. This guide shows both student and professional alike the basic concepts of parallel programming and GPU architecture. Topics of performance, floating-point format, parallel patterns, and dynamic parallelism are covered in depth. This revised edition contains more parallel programming examples, commonly-used libraries such as Thrust, and explanations of the latest tools. It also provides new coverage of CUDA 5.0, improved performance, enhanced development tools, increased hardware support, and more; increased coverage of related technology, OpenCL and new material on algorithm patterns, GPU clusters, host programming, and data parallelism; and two new case studies (on MRI reconstruction and molecular visualization) that explore the latest applications of CUDA and GPUs for scientific research and high-performance computing. This book should be a valuable resource for advanced students, software engineers, programmers, and hardware engineers.

Key Features

  • New coverage of CUDA 5.0, improved performance, enhanced development tools, increased hardware support, and more
  • Increased coverage of related technology, OpenCL and new material on algorithm patterns, GPU clusters, host programming, and data parallelism
  • Two new case studies (on MRI reconstruction and molecular visualization) explore the latest applications of CUDA and GPUs for scientific research and high-performance computing


Advanced students, software engineers, programmers, hardware engineers

Table of Contents

  • Preface

    Target Audience

    How to Use the Book

    Online Supplements



    Chapter 1. Introduction

    1.1 Heterogeneous Parallel Computing

    1.2 Architecture of a Modern GPU

    1.3 Why More Speed or Parallelism?

    1.4 Speeding Up Real Applications

    1.5 Parallel Programming Languages and Models

    1.6 Overarching Goals

    1.7 Organization of the Book


    Chapter 2. History of GPU Computing

    2.1 Evolution of Graphics Pipelines

    2.2 GPGPU: An Intermediate Step

    2.3 GPU Computing

    References and Further Reading

    Chapter 3. Introduction to Data Parallelism and CUDA C

    3.1 Data Parallelism

    3.2 CUDA Program Structure

    3.3 A Vector Addition Kernel

    3.4 Device Global Memory and Data Transfer

    3.5 Kernel Functions and Threading

    3.6 Summary

    3.7 Exercises


    Chapter 4. Data-Parallel Execution Model

    4.1 Cuda Thread Organization

    4.2 Mapping Threads to Multidimensional Data

    4.3 Matrix-Matrix Multiplication—A More Complex Kernel

    4.4 Synchronization and Transparent Scalability

    4.5 Assigning Resources to Blocks

    4.6 Querying Device Properties

    4.7 Thread Scheduling and Latency Tolerance

    4.8 Summary

    4.9 Exercises

    Chapter 5. CUDA Memories

    5.1 Importance of Memory Access Efficiency

    5.2 CUDA Device Memory Types

    5.3 A Strategy for Reducing Global Memory Traffic

    5.4 A Tiled Matrix–Matrix Multiplication Kernel

    5.5 Memory as a Limiting Factor to Parallelism

    5.6 Summary

    5.7 Exercises

    Chapter 6. Performance Considerations

    6.1 Warps and Thread Execution

    6.2 Global Memory Bandwidth

    6.3 Dynamic Partitioning of Execution Resources

    6.4 Instruction Mix and Thread Granularity

    6.5 Summary

    6.6 Exercises


    Chapter 7. Floating-Point Considerations

    7.1 Floating-Point Format

    7.2 Representable Numbers

    7.3 Special Bit Patterns and Precision in IEEE Format

    7.4 Arithmetic Accuracy and Rounding

    7.5 Algorithm Considerations

    7.6 Numerical Stability

    7.7 Summary

    7.8 Exercises


    Chapter 8. Parallel Patterns: Convolution: With an Introduction to Constant Memory and Caches

    8.1 Background

    8.2 1D Parallel Convolution—A Basic Algorithm

    8.3 Constant Memory and Caching

    8.4 Tiled 1D Convolution with Halo Elements

    8.5 A Simpler Tiled 1D Convolution—General Caching

    8.6 Summary

    8.7 Exercises

    Chapter 9. Parallel Patterns: Prefix Sum: An Introduction to Work Efficiency in Parallel Algorithms

    9.1 Background

    9.2 A Simple Parallel Scan

    9.3 Work Efficiency Considerations

    9.4 A Work-Efficient Parallel Scan

    9.5 Parallel Scan for Arbitrary-Length Inputs

    9.6 Summary

    9.7 Exercises


    Chapter 10. Parallel Patterns: Sparse Matrix–Vector Multiplication: An Introduction to Compaction and Regularization in Parallel Algorithms

    10.1 Background

    10.2 Parallel SpMV Using CSR

    10.3 Padding and Transposition

    10.4 Using Hybrid to Control Padding

    10.5 Sorting and Partitioning for Regularization

    10.6 Summary

    10.7 Exercises


    Chapter 11. Application Case Study: Advanced MRI Reconstruction

    11.1 Application Background

    11.2 Iterative Reconstruction

    11.3 Computing FHD

    11.4 Final Evaluation

    11.5 Exercises


    Chapter 12. Application Case Study: Molecular Visualization and Analysis

    12.1 Application Background

    12.2 A Simple Kernel Implementation

    12.3 Thread Granularity Adjustment

    12.4 Memory Coalescing

    12.5 Summary

    12.6 Exercises


    Chapter 13. Parallel Programming and Computational Thinking

    13.1 Goals of Parallel Computing

    13.2 Problem Decomposition

    13.3 Algorithm Selection

    13.4 Computational Thinking

    13.5 Summary

    13.6 Exercises


    Chapter 14. An Introduction to OpenCLTM

    14.1 Background

    14.2 Data Parallelism Model

    14.3 Device Architecture

    14.4 Kernel Functions

    14.5 Device Management and Kernel Launch

    14.6 Electrostatic Potential Map in OpenCL

    14.7 Summary

    14.8 Exercises


    Chapter 15. Parallel Programming with OpenACC

    15.1 OpenACC Versus CUDA C

    15.2 Execution Model

    15.3 Memory Model

    15.4 Basic OpenACC Programs

    15.5 Future Directions of OpenACC

    15.6 Exercises

    Chapter 16. Thrust: A Productivity-Oriented Library for CUDA

    16.1 Background

    16.2 Motivation

    16.3 Basic Thrust Features

    16.4 Generic Programming

    16.5 Benefits of Abstraction

    16.6 Programmer Productivity

    16.7 Best Practices

    16.8 Exercises


    Chapter 17. CUDA FORTRAN

    17.1 CUDA FORTRAN and CUDA C Differences

    17.2 A First CUDA FORTRAN Program

    17.3 Multidimensional Array in CUDA FORTRAN

    17.4 Overloading Host/Device Routines With Generic Interfaces

    17.5 Calling CUDA C Via Iso_C_Binding

    17.6 Kernel Loop Directives and Reduction Operations

    17.7 Dynamic Shared Memory

    17.8 Asynchronous Data Transfers

    17.9 Compilation and Profiling

    17.10 Calling Thrust from CUDA FORTRAN

    17.11 Exercises

    Chapter 18. An Introduction to C++ AMP

    18.1 Core C++ AMP Features

    18.2 Details of the C++ AMP Execution Model

    18.3 Managing Accelerators

    18.4 Tiled Execution

    18.5 C++ AMP Graphics Features

    18.6 Summary

    18.7 Exercises

    Chapter 19. Programming a Heterogeneous Computing Cluster

    19.1 Background

    19.2 A Running Example

    19.3 MPI Basics

    19.4 MPI Point-to-Point Communication Types

    19.5 Overlapping Computation and Communication

    19.6 MPI Collective Communication

    19.7 Summary

    19.8 Exercises


    Chapter 20. CUDA Dynamic Parallelism

    20.1 Background

    20.2 Dynamic Parallelism Overview

    20.3 Important Details

    20.4 Memory Visibility

    20.5 A Simple Example

    20.6 Runtime Limitations

    20.7 A More Complex Example

    20.8 Summary


    Chapter 21. Conclusion and Future Outlook

    21.1 Goals Revisited

    21.2 Memory Model Evolution

    21.3 Kernel Execution Control Evolution

    21.4 Core Performance

    21.5 Programming Environment

    21.6 Future Outlook


    Appendix A. Matrix Multiplication Host-Only Version Source Code

    Appendix Outline


    A.2 matrixmul_gold.cpp

    A.3 matrixmul.h

    A.4 assist.h

    A.5 Expected Output

    Appendix B. GPU Compute Capabilities

    Appendix Outline

    B.1 GPU Compute Capability Tables

    B.2 Memory Coalescing Variations


Product details

  • No. of pages: 520
  • Language: English
  • Copyright: © Morgan Kaufmann 2012
  • Published: December 14, 2012
  • Imprint: Morgan Kaufmann
  • eBook ISBN: 9780123914187

About the Authors

David Kirk

David Kirk
David B. Kirk is well recognized for his contributions to graphics hardware and algorithm research. By the time he began his studies at Caltech, he had already earned B.S. and M.S. degrees in mechanical engineering from MIT and worked as an engineer for Raster Technologies and Hewlett-Packard's Apollo Systems Division, and after receiving his doctorate, he joined Crystal Dynamics, a video-game manufacturing company, as chief scientist and head of technology. In 1997, he took the position of Chief Scientist at NVIDIA, a leader in visual computing technologies, and he is currently an NVIDIA Fellow.

At NVIDIA, Kirk led graphics-technology development for some of today's most popular consumer-entertainment platforms, playing a key role in providing mass-market graphics capabilities previously available only on workstations costing hundreds of thousands of dollars. For his role in bringing high-performance graphics to personal computers, Kirk received the 2002 Computer Graphics Achievement Award from the Association for Computing Machinery and the Special Interest Group on Graphics and Interactive Technology (ACM SIGGRAPH) and, in 2006, was elected to the National Academy of Engineering, one of the highest professional distinctions for engineers.

Kirk holds 50 patents and patent applications relating to graphics design and has published more than 50 articles on graphics technology, won several best-paper awards, and edited the book Graphics Gems III. A technological "evangelist" who cares deeply about education, he has supported new curriculum initiatives at Caltech and has been a frequent university lecturer and conference keynote speaker worldwide.

Affiliations and Expertise


Wen-mei W. Hwu

Wen-mei W. Hwu
Wen-mei W. Hwu is a Professor and holds the Sanders-AMD Endowed Chair in the Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign. His research interests are in the area of architecture, implementation, compilation, and algorithms for parallel computing. He is the chief scientist of Parallel Computing Institute and director of the IMPACT research group ( He is a co-founder and CTO of MulticoreWare. For his contributions in research and teaching, he received the ACM SigArch Maurice Wilkes Award, the ACM Grace Murray Hopper Award, the Tau Beta Pi Daniel C. Drucker Eminent Faculty Award, the ISCA Influential Paper Award, the IEEE Computer Society B. R. Rau Award and the Distinguished Alumni Award in Computer Science of the University of California, Berkeley. He is a fellow of IEEE and ACM. He directs the UIUC CUDA Center of Excellence and serves as one of the principal investigators of the NSF Blue Waters Petascale computer project. Dr. Hwu received his Ph.D. degree in Computer Science from the University of California, Berkeley.

Affiliations and Expertise

CTO, MulticoreWare and professor specializing in compiler design, computer architecture, microarchitecture, and parallel processing, University of Illinois at Urbana-Champaign, USA

Ratings and Reviews

Write a review

There are currently no reviews for "Programming Massively Parallel Processors"