Save up to 30% on Elsevier print and eBooks with free shipping. No promo code needed.
Save up to 30% on print and eBooks.
Programming Massively Parallel Processors
A Hands-on Approach
4th Edition - May 28, 2022
Authors: Wen-mei W. Hwu, David B. Kirk, Izzat El Hajj
Language: English
Paperback ISBN:9780323912310
9 7 8 - 0 - 3 2 3 - 9 1 2 3 1 - 0
eBook ISBN:9780323984638
9 7 8 - 0 - 3 2 3 - 9 8 4 6 3 - 8
Programming Massively Parallel Processors: A Hands-on Approach shows both students and professionals alike the basic concepts of parallel programming and GPU architect…Read more
Purchase options
LIMITED OFFER
Save 50% on book bundles
Immediately download your ebook while waiting for your print delivery. No promo code is needed.
Programming Massively Parallel Processors: A Hands-on Approach shows both students and professionals alike the basic concepts of parallel programming and GPU architecture. Concise, intuitive, and practical, it is based on years of road-testing in the authors' own parallel computing courses. Various techniques for constructing and optimizing parallel programs are explored in detail, while case studies demonstrate the development process, which begins with computational thinking and ends with effective and efficient parallel programs. The new edition includes updated coverage of CUDA, including the newer libraries such as CuDNN. New chapters on frequently used parallel patterns have been added, and case studies have been updated to reflect current industry practices.
Parallel Patterns Introduces new chapters on frequently used parallel patterns (stencil, reduction, sorting) and major improvements to previous chapters (convolution, histogram, sparse matrices, graph traversal, deep learning)
Ampere Includes a new chapter focused on GPU architecture and draws examples from recent architecture generations, including Ampere
Systematic Approach Incorporates major improvements to abstract discussions of problem decomposition strategies and performance considerations, with a new optimization checklist
Upper level through grad level students studying parallel computing within computer science or engineering
Cover image
Title page
Table of Contents
Copyright
Dedication
Foreword
Preface
How to use the book
A two-phased approach
Tying it all together: the final project
The design document
The project report and symposium
Class competition
Course resources
Acknowledgments
Chapter 1. Introduction
Abstract
Chapter Outline
1.1 Heterogeneous parallel computing
1.2 Why more speed or parallelism?
1.3 Speeding up real applications
1.4 Challenges in parallel programming
1.5 Related parallel programming interfaces
1.6 Overarching goals
1.7 Organization of the book
References
Part I: Fundamental Concepts
Chapter 2. Heterogeneous data parallel computing
Abstract
Chapter Outline
2.1 Data parallelism
2.2 CUDA C program structure
2.3 A vector addition kernel
2.4 Device global memory and data transfer
2.5 Kernel functions and threading
2.6 Calling kernel functions
2.7 Compilation
2.8 Summary
Exercises
References
Chapter 3. Multidimensional grids and data
Abstract
Chapter Outline
3.1 Multidimensional grid organization
3.2 Mapping threads to multidimensional data
3.3 Image blur: a more complex kernel
3.4 Matrix multiplication
3.5 Summary
Exercises
Chapter 4. Compute architecture and scheduling
Abstract
Chapter Outline
4.1 Architecture of a modern GPU
4.2 Block scheduling
4.3 Synchronization and transparent scalability
4.4 Warps and SIMD hardware
4.5 Control divergence
4.6 Warp scheduling and latency tolerance
4.7 Resource partitioning and occupancy
4.8 Querying device properties
4.9 Summary
Exercises
References
Chapter 5. Memory architecture and data locality
Abstract
Chapter Outline
5.1 Importance of memory access efficiency
5.2 CUDA memory types
5.3 Tiling for reduced memory traffic
5.4 A tiled matrix multiplication kernel
5.5 Boundary checks
5.6 Impact of memory usage on occupancy
5.7 Summary
Exercises
Chapter 6. Performance considerations
Abstract
Chapter Outline
6.1 Memory coalescing
6.2 Hiding memory latency
6.3 Thread coarsening
6.4 A checklist of optimizations
6.5 Knowing your computation’s bottleneck
6.6 Summary
Exercises
References
Part II: Parallel Patterns
Chapter 7. Convolution: An introduction to constant memory and caching
Abstract
Chapter Outline
7.1 Background
7.2 Parallel convolution: a basic algorithm
7.3 Constant memory and caching
7.4 Tiled convolution with halo cells
7.5 Tiled convolution using caches for halo cells
7.6 Summary
Exercises
Chapter 8. Stencil
Abstract
Chapter Outline
8.1 Background
8.2 Parallel stencil: a basic algorithm
8.3 Shared memory tiling for stencil sweep
8.4 Thread coarsening
8.5 Register tiling
8.6 Summary
Exercises
Chapter 9. Parallel histogram: An introduction to atomic operations and privatization
Abstract
Chapter Outline
9.1 Background
9.2 Atomic operations and a basic histogram kernel
9.3 Latency and throughput of atomic operations
9.4 Privatization
9.5 Coarsening
9.6 Aggregation
9.7 Summary
Exercises
References
Chapter 10. Reduction: And minimizing divergence
Abstract
Chapter Outline
10.1 Background
10.2 Reduction trees
10.3 A simple reduction kernel
10.4 Minimizing control divergence
10.5 Minimizing memory divergence
10.6 Minimizing global memory accesses
10.7 Hierarchical reduction for arbitrary input length
10.8 Thread coarsening for reduced overhead
10.9 Summary
Exercises
Chapter 11. Prefix sum (scan): An introduction to work efficiency in parallel algorithms
Abstract
Chapter Outline
11.1 Background
11.2 Parallel scan with the Kogge-Stone algorithm
11.3 Speed and work efficiency consideration
11.4 Parallel scan with the Brent-Kung algorithm
11.5 Coarsening for even more work efficiency
11.6 Segmented parallel scan for arbitrary-length inputs
11.7 Single-pass scan for memory access efficiency
11.8 Summary
Exercises
References
Chapter 12. Merge: An introduction to dynamic input data identification
Abstract
Chapter Outline
12.1 Background
12.2 A sequential merge algorithm
12.3 A parallelization approach
12.4 Co-rank function implementation
12.5 A basic parallel merge kernel
12.6 A tiled merge kernel to improve coalescing
12.7 A circular buffer merge kernel
12.8 Thread coarsening for merge
12.9 Summary
Exercises
References
Part III: Advanced Patterns and Applications
Chapter 13. Sorting
Abstract
Chapter Outline
13.1 Background
13.2 Radix sort
13.3 Parallel radix sort
13.4 Optimizing for memory coalescing
13.5 Choice of radix value
13.6 Thread coarsening to improve coalescing
13.7 Parallel merge sort
13.8 Other parallel sort methods
13.9 Summary
Exercises
References
Chapter 14. Sparse matrix computation
Abstract
Chapter Outline
14.1 Background
14.2 A simple SpMV kernel with the COO format
14.3 Grouping row nonzeros with the CSR format
14.4 Improving memory coalescing with the ELL format
14.5 Regulating padding with the hybrid ELL-COO format
14.6 Reducing control divergence with the JDS format
14.7 Summary
Exercises
References
Chapter 15. Graph traversal
Abstract
Chapter Outline
15.1 Background
15.2 Breadth-first search
15.3 Vertex-centric parallelization of breadth-first search
15.4 Edge-centric parallelization of breadth-first search
15.5 Improving efficiency with frontiers
15.6 Reducing contention with privatization
15.7 Other optimizations
15.8 Summary
Exercises
References
Chapter 16. Deep learning
Abstract
Chapter Outline
16.1 Background
16.2 Convolutional neural networks
16.3 Convolutional layer: a CUDA inference kernel
16.4 Formulating a convolutional layer as GEMM
16.5 CUDNN library
16.6 Summary
Exercises
References
Chapter 17. Iterative magnetic resonance imaging reconstruction
Abstract
Chapter Outline
17.1 Background
17.2 Iterative reconstruction
17.3 Computing FHD
17.4 Summary
Exercises
References
Chapter 18. Electrostatic potential map
Abstract
Chapter Outline
18.1 Background
18.2 Scatter versus gather in kernel design
18.3 Thread coarsening
18.4 Memory coalescing
18.5 Cutoff binning for data size scalability
18.6 Summary
Exercises
References
Chapter 19. Parallel programming and computational thinking
Abstract
Chapter Outline
19.1 Goals of parallel computing
19.2 Algorithm selection
19.3 Problem decomposition
19.4 Computational thinking
19.5 Summary
References
Part IV: Advanced Practices
Chapter 20. Programming a heterogeneous computing cluster: An introduction to CUDA streams
Abstract
Chapter Outline
20.1 Background
20.2 A running example
20.3 Message passing interface basics
20.4 Message passing interface point-to-point communication
20.5 Overlapping computation and communication
20.6 Message passing interface collective communication
20.7 CUDA aware message passing interface
20.8 Summary
Exercises
References
Chapter 21. CUDA dynamic parallelism
Abstract
Chapter Outline
21.1 Background
21.2 Dynamic parallelism overview
21.3 An example: Bezier curves
21.4 A recursive example: quadtrees
21.5 Important considerations
21.6 Summary
Exercises
A21.1 Support code for quadtree example
References
Chapter 22. Advanced practices and future evolution
Abstract
Chapter Outline
22.1 Model of host/device interaction
22.2 Kernel execution control
22.3 Memory bandwidth and compute throughput
22.4 Programming environment
22.5 Future outlook
References
Chapter 23. Conclusion and outlook
Abstract
Chapter Outline
23.1 Goals revisited
23.2 Future outlook
Appendix A. Numerical considerations
A.1 Floating-point data representation
A.2 Representable numbers
A.3 Special bit patterns and precision in IEEE format
A.4 Arithmetic accuracy and rounding
A.5 Algorithm considerations
A.6 Linear solvers and numerical stability
A.7 Summary
Exercises
Index
No. of pages: 580
Language: English
Edition: 4
Published: May 28, 2022
Imprint: Morgan Kaufmann
Paperback ISBN: 9780323912310
eBook ISBN: 9780323984638
WH
Wen-mei W. Hwu
Wen-mei W. Hwu is a Professor and holds the Sanders-AMD Endowed Chair in the Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign. His research interests are in the area of architecture, implementation, compilation, and algorithms for parallel computing. He is the chief scientist of Parallel Computing Institute and director of the IMPACT research group (www.impact.crhc.illinois.edu). He is a co-founder and CTO of MulticoreWare. For his contributions in research and teaching, he received the ACM SigArch Maurice Wilkes Award, the ACM Grace Murray Hopper Award, the Tau Beta Pi Daniel C. Drucker Eminent Faculty Award, the ISCA Influential Paper Award, the IEEE Computer Society B. R. Rau Award and the Distinguished Alumni Award in Computer Science of the University of California, Berkeley. He is a fellow of IEEE and ACM. He directs the UIUC CUDA Center of Excellence and serves as one of the principal investigators of the NSF Blue Waters Petascale computer project. Dr. Hwu received his Ph.D. degree in Computer Science from the University of California, Berkeley.
Affiliations and expertise
CTO, MulticoreWare and professor specializing in compiler design, computer architecture, microarchitecture, and parallel processing, University of Illinois at Urbana-Champaign, USA
DK
David B. Kirk
David B. Kirk is well recognized for his contributions to graphics hardware and algorithm research. By the time he began his studies at Caltech, he had already earned B.S. and M.S. degrees in mechanical engineering from MIT and worked as an engineer for Raster Technologies and Hewlett-Packard's Apollo Systems Division, and after receiving his doctorate, he joined Crystal Dynamics, a video-game manufacturing company, as chief scientist and head of technology. In 1997, he took the position of Chief Scientist at NVIDIA, a leader in visual computing technologies, and he is currently an NVIDIA Fellow.
At NVIDIA, Kirk led graphics-technology development for some of today's most popular consumer-entertainment platforms, playing a key role in providing mass-market graphics capabilities previously available only on workstations costing hundreds of thousands of dollars. For his role in bringing high-performance graphics to personal computers, Kirk received the 2002 Computer Graphics Achievement Award from the Association for Computing Machinery and the Special Interest Group on Graphics and Interactive Technology (ACM SIGGRAPH) and, in 2006, was elected to the National Academy of Engineering, one of the highest professional distinctions for engineers.
Kirk holds 50 patents and patent applications relating to graphics design and has published more than 50 articles on graphics technology, won several best-paper awards, and edited the book Graphics Gems III. A technological "evangelist" who cares deeply about education, he has supported new curriculum initiatives at Caltech and has been a frequent university lecturer and conference keynote speaker worldwide.
Affiliations and expertise
NVIDIA Fellow
IE
Izzat El Hajj
Izzat El Hajj is an Assistant Professor in the Department of Computer Science at the American University of Beirut. His research interests are in application acceleration and programming support for emerging parallel processors and memory technologies, with a particular interest in GPUs and processing-in-memory. He received his Ph.D. in Electrical and Computer Engineering at the University of Illinois at Urbana-Champaign. He is a recipient of the Dan Vivoli Endowed Fellowship at the University of Illinois at Urbana-Champaign, and the Distinguished Graduate Award at the American University of Beirut.
Affiliations and expertise
Assistant Professor, Department of Computer Science, American University of Beirut, Lebanon
Read Programming Massively Parallel Processors on ScienceDirect