Programming Massively Parallel Processors

Programming Massively Parallel Processors

A Hands-on Approach

4th Edition - May 28, 2022

Write a review

  • Authors: Wen-mei Hwu, David Kirk, Izzat El Hajj
  • Paperback ISBN: 9780323912310
  • eBook ISBN: 9780323984638

Purchase options

Purchase options
In Stock
DRM-free (EPub, PDF)
Sales tax will be calculated at check-out

Institutional Subscription

Free Global Shipping
No minimum order

Description

Programming Massively Parallel Processors: A Hands-on Approach shows both student and professional alike the basic concepts of parallel programming and GPU architecture. Various techniques for constructing parallel programs are explored in detail. Case studies demonstrate the development process, which begins with computational thinking and ends with effective and efficient parallel programs. Topics of performance, floating-point format, parallel patterns, and dynamic parallelism are covered in depth.  For this new edition, the authors are updating their coverage of CUDA, including the concept of unified memory, and expanding content in areas such as threads, while still retaining its concise, intuitive, practical approach based on years of road-testing in the authors' own parallel computing courses.

Key Features

  • Teaches computational thinking and problem-solving techniques that facilitate high-performance parallel computing
  • Updated to utilize CUDA version 10.0, NVIDIA's software development tool created specifically for massively parallel environments
  • Features new content on unified memory, as well as expanded content on threads, streams, warp divergence, and OpenMP
  • Includes updated and new case studies

Readership

Upper level through grad level students studying parallel computing within computer science or engineering / According to Navstem there are currently 8,600 students enrolled annually in such courses in the US. Software engineers, programmers, hardware engineers

Table of Contents

  • Cover image
  • Title page
  • Table of Contents
  • Copyright
  • Dedication
  • Foreword
  • Preface
  • How to use the book
  • A two-phased approach
  • Tying it all together: the final project
  • The design document
  • The project report and symposium
  • Class competition
  • Course resources
  • Acknowledgments
  • Chapter 1. Introduction
  • Abstract
  • Chapter Outline
  • 1.1 Heterogeneous parallel computing
  • 1.2 Why more speed or parallelism?
  • 1.3 Speeding up real applications
  • 1.4 Challenges in parallel programming
  • 1.5 Related parallel programming interfaces
  • 1.6 Overarching goals
  • 1.7 Organization of the book
  • References
  • Part I: Fundamental Concepts
  • Chapter 2. Heterogeneous data parallel computing
  • Abstract
  • Chapter Outline
  • 2.1 Data parallelism
  • 2.2 CUDA C program structure
  • 2.3 A vector addition kernel
  • 2.4 Device global memory and data transfer
  • 2.5 Kernel functions and threading
  • 2.6 Calling kernel functions
  • 2.7 Compilation
  • 2.8 Summary
  • Exercises
  • References
  • Chapter 3. Multidimensional grids and data
  • Abstract
  • Chapter Outline
  • 3.1 Multidimensional grid organization
  • 3.2 Mapping threads to multidimensional data
  • 3.3 Image blur: a more complex kernel
  • 3.4 Matrix multiplication
  • 3.5 Summary
  • Exercises
  • Chapter 4. Compute architecture and scheduling
  • Abstract
  • Chapter Outline
  • 4.1 Architecture of a modern GPU
  • 4.2 Block scheduling
  • 4.3 Synchronization and transparent scalability
  • 4.4 Warps and SIMD hardware
  • 4.5 Control divergence
  • 4.6 Warp scheduling and latency tolerance
  • 4.7 Resource partitioning and occupancy
  • 4.8 Querying device properties
  • 4.9 Summary
  • Exercises
  • References
  • Chapter 5. Memory architecture and data locality
  • Abstract
  • Chapter Outline
  • 5.1 Importance of memory access efficiency
  • 5.2 CUDA memory types
  • 5.3 Tiling for reduced memory traffic
  • 5.4 A tiled matrix multiplication kernel
  • 5.5 Boundary checks
  • 5.6 Impact of memory usage on occupancy
  • 5.7 Summary
  • Exercises
  • Chapter 6. Performance considerations
  • Abstract
  • Chapter Outline
  • 6.1 Memory coalescing
  • 6.2 Hiding memory latency
  • 6.3 Thread coarsening
  • 6.4 A checklist of optimizations
  • 6.5 Knowing your computation’s bottleneck
  • 6.6 Summary
  • Exercises
  • References
  • Part II: Parallel Patterns
  • Chapter 7. Convolution: An introduction to constant memory and caching
  • Abstract
  • Chapter Outline
  • 7.1 Background
  • 7.2 Parallel convolution: a basic algorithm
  • 7.3 Constant memory and caching
  • 7.4 Tiled convolution with halo cells
  • 7.5 Tiled convolution using caches for halo cells
  • 7.6 Summary
  • Exercises
  • Chapter 8. Stencil
  • Abstract
  • Chapter Outline
  • 8.1 Background
  • 8.2 Parallel stencil: a basic algorithm
  • 8.3 Shared memory tiling for stencil sweep
  • 8.4 Thread coarsening
  • 8.5 Register tiling
  • 8.6 Summary
  • Exercises
  • Chapter 9. Parallel histogram: An introduction to atomic operations and privatization
  • Abstract
  • Chapter Outline
  • 9.1 Background
  • 9.2 Atomic operations and a basic histogram kernel
  • 9.3 Latency and throughput of atomic operations
  • 9.4 Privatization
  • 9.5 Coarsening
  • 9.6 Aggregation
  • 9.7 Summary
  • Exercises
  • References
  • Chapter 10. Reduction: And minimizing divergence
  • Abstract
  • Chapter Outline
  • 10.1 Background
  • 10.2 Reduction trees
  • 10.3 A simple reduction kernel
  • 10.4 Minimizing control divergence
  • 10.5 Minimizing memory divergence
  • 10.6 Minimizing global memory accesses
  • 10.7 Hierarchical reduction for arbitrary input length
  • 10.8 Thread coarsening for reduced overhead
  • 10.9 Summary
  • Exercises
  • Chapter 11. Prefix sum (scan): An introduction to work efficiency in parallel algorithms
  • Abstract
  • Chapter Outline
  • 11.1 Background
  • 11.2 Parallel scan with the Kogge-Stone algorithm
  • 11.3 Speed and work efficiency consideration
  • 11.4 Parallel scan with the Brent-Kung algorithm
  • 11.5 Coarsening for even more work efficiency
  • 11.6 Segmented parallel scan for arbitrary-length inputs
  • 11.7 Single-pass scan for memory access efficiency
  • 11.8 Summary
  • Exercises
  • References
  • Chapter 12. Merge: An introduction to dynamic input data identification
  • Abstract
  • Chapter Outline
  • 12.1 Background
  • 12.2 A sequential merge algorithm
  • 12.3 A parallelization approach
  • 12.4 Co-rank function implementation
  • 12.5 A basic parallel merge kernel
  • 12.6 A tiled merge kernel to improve coalescing
  • 12.7 A circular buffer merge kernel
  • 12.8 Thread coarsening for merge
  • 12.9 Summary
  • Exercises
  • References
  • Part III: Advanced Patterns and Applications
  • Chapter 13. Sorting
  • Abstract
  • Chapter Outline
  • 13.1 Background
  • 13.2 Radix sort
  • 13.3 Parallel radix sort
  • 13.4 Optimizing for memory coalescing
  • 13.5 Choice of radix value
  • 13.6 Thread coarsening to improve coalescing
  • 13.7 Parallel merge sort
  • 13.8 Other parallel sort methods
  • 13.9 Summary
  • Exercises
  • References
  • Chapter 14. Sparse matrix computation
  • Abstract
  • Chapter Outline
  • 14.1 Background
  • 14.2 A simple SpMV kernel with the COO format
  • 14.3 Grouping row nonzeros with the CSR format
  • 14.4 Improving memory coalescing with the ELL format
  • 14.5 Regulating padding with the hybrid ELL-COO format
  • 14.6 Reducing control divergence with the JDS format
  • 14.7 Summary
  • Exercises
  • References
  • Chapter 15. Graph traversal
  • Abstract
  • Chapter Outline
  • 15.1 Background
  • 15.2 Breadth-first search
  • 15.3 Vertex-centric parallelization of breadth-first search
  • 15.4 Edge-centric parallelization of breadth-first search
  • 15.5 Improving efficiency with frontiers
  • 15.6 Reducing contention with privatization
  • 15.7 Other optimizations
  • 15.8 Summary
  • Exercises
  • References
  • Chapter 16. Deep learning
  • Abstract
  • Chapter Outline
  • 16.1 Background
  • 16.2 Convolutional neural networks
  • 16.3 Convolutional layer: a CUDA inference kernel
  • 16.4 Formulating a convolutional layer as GEMM
  • 16.5 CUDNN library
  • 16.6 Summary
  • Exercises
  • References
  • Chapter 17. Iterative magnetic resonance imaging reconstruction
  • Abstract
  • Chapter Outline
  • 17.1 Background
  • 17.2 Iterative reconstruction
  • 17.3 Computing FHD
  • 17.4 Summary
  • Exercises
  • References
  • Chapter 18. Electrostatic potential map
  • Abstract
  • Chapter Outline
  • 18.1 Background
  • 18.2 Scatter versus gather in kernel design
  • 18.3 Thread coarsening
  • 18.4 Memory coalescing
  • 18.5 Cutoff binning for data size scalability
  • 18.6 Summary
  • Exercises
  • References
  • Chapter 19. Parallel programming and computational thinking
  • Abstract
  • Chapter Outline
  • 19.1 Goals of parallel computing
  • 19.2 Algorithm selection
  • 19.3 Problem decomposition
  • 19.4 Computational thinking
  • 19.5 Summary
  • References
  • Part IV: Advanced Practices
  • Chapter 20. Programming a heterogeneous computing cluster: An introduction to CUDA streams
  • Abstract
  • Chapter Outline
  • 20.1 Background
  • 20.2 A running example
  • 20.3 Message passing interface basics
  • 20.4 Message passing interface point-to-point communication
  • 20.5 Overlapping computation and communication
  • 20.6 Message passing interface collective communication
  • 20.7 CUDA aware message passing interface
  • 20.8 Summary
  • Exercises
  • References
  • Chapter 21. CUDA dynamic parallelism
  • Abstract
  • Chapter Outline
  • 21.1 Background
  • 21.2 Dynamic parallelism overview
  • 21.3 An example: Bezier curves
  • 21.4 A recursive example: quadtrees
  • 21.5 Important considerations
  • 21.6 Summary
  • Exercises
  • A21.1 Support code for quadtree example
  • References
  • Chapter 22. Advanced practices and future evolution
  • Abstract
  • Chapter Outline
  • 22.1 Model of host/device interaction
  • 22.2 Kernel execution control
  • 22.3 Memory bandwidth and compute throughput
  • 22.4 Programming environment
  • 22.5 Future outlook
  • References
  • Chapter 23. Conclusion and outlook
  • Abstract
  • Chapter Outline
  • 23.1 Goals revisited
  • 23.2 Future outlook
  • Appendix A. Numerical considerations
  • A.1 Floating-point data representation
  • A.2 Representable numbers
  • A.3 Special bit patterns and precision in IEEE format
  • A.4 Arithmetic accuracy and rounding
  • A.5 Algorithm considerations
  • A.6 Linear solvers and numerical stability
  • A.7 Summary
  • Exercises
  • Index

Product details

  • No. of pages: 580
  • Language: English
  • Copyright: © Morgan Kaufmann 2022
  • Published: May 28, 2022
  • Imprint: Morgan Kaufmann
  • Paperback ISBN: 9780323912310
  • eBook ISBN: 9780323984638

About the Authors

Wen-mei Hwu

Wen-mei Hwu
Wen-mei W. Hwu is a Professor and holds the Sanders-AMD Endowed Chair in the Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign. His research interests are in the area of architecture, implementation, compilation, and algorithms for parallel computing. He is the chief scientist of Parallel Computing Institute and director of the IMPACT research group (www.impact.crhc.illinois.edu). He is a co-founder and CTO of MulticoreWare. For his contributions in research and teaching, he received the ACM SigArch Maurice Wilkes Award, the ACM Grace Murray Hopper Award, the Tau Beta Pi Daniel C. Drucker Eminent Faculty Award, the ISCA Influential Paper Award, the IEEE Computer Society B. R. Rau Award and the Distinguished Alumni Award in Computer Science of the University of California, Berkeley. He is a fellow of IEEE and ACM. He directs the UIUC CUDA Center of Excellence and serves as one of the principal investigators of the NSF Blue Waters Petascale computer project. Dr. Hwu received his Ph.D. degree in Computer Science from the University of California, Berkeley.

Affiliations and Expertise

CTO, MulticoreWare and professor specializing in compiler design, computer architecture, microarchitecture, and parallel processing, University of Illinois at Urbana-Champaign, USA

David Kirk

David Kirk
David B. Kirk is well recognized for his contributions to graphics hardware and algorithm research. By the time he began his studies at Caltech, he had already earned B.S. and M.S. degrees in mechanical engineering from MIT and worked as an engineer for Raster Technologies and Hewlett-Packard's Apollo Systems Division, and after receiving his doctorate, he joined Crystal Dynamics, a video-game manufacturing company, as chief scientist and head of technology. In 1997, he took the position of Chief Scientist at NVIDIA, a leader in visual computing technologies, and he is currently an NVIDIA Fellow.

At NVIDIA, Kirk led graphics-technology development for some of today's most popular consumer-entertainment platforms, playing a key role in providing mass-market graphics capabilities previously available only on workstations costing hundreds of thousands of dollars. For his role in bringing high-performance graphics to personal computers, Kirk received the 2002 Computer Graphics Achievement Award from the Association for Computing Machinery and the Special Interest Group on Graphics and Interactive Technology (ACM SIGGRAPH) and, in 2006, was elected to the National Academy of Engineering, one of the highest professional distinctions for engineers.

Kirk holds 50 patents and patent applications relating to graphics design and has published more than 50 articles on graphics technology, won several best-paper awards, and edited the book Graphics Gems III. A technological "evangelist" who cares deeply about education, he has supported new curriculum initiatives at Caltech and has been a frequent university lecturer and conference keynote speaker worldwide.

Affiliations and Expertise

NVIDIA Fellow

Izzat El Hajj

Izzat El Hajj is an Assistant Professor in the Department of Computer Science at the American University of Beirut. His research interests are in application acceleration and programming support for emerging parallel processors and memory technologies, with a particular interest in GPUs and processing-in-memory. He received his Ph.D. in Electrical and Computer Engineering at the University of Illinois at Urbana-Champaign. He is a recipient of the Dan Vivoli Endowed Fellowship at the University of Illinois at Urbana-Champaign, and the Distinguished Graduate Award at the American University of Beirut.

Affiliations and Expertise

Assistant Professor, Department of Computer Science, American University of Beirut, Lebanon

Ratings and Reviews

Write a review

There are currently no reviews for "Programming Massively Parallel Processors"