Programming Massively Parallel Processors
A Hands-on Approach
By- David Kirk, NVIDIA Fellow
- Wen-mei Hwu, Professor, University of Illinois
This best-selling guide to CUDA and GPU parallel programming has been revised with more parallel programming examples, commonly-used libraries, and explanations of the latest tools. With these improvements, the book retains its concise, intuitive, practical approach based on years of road-testing in the authors' own parallel computing courses.
Programming Massively Parallel Processors: A Hands-on Approach shows both student and professional alike the basic concepts of parallel programming and GPU architecture. Various techniques for constructing parallel programs are explored in detail. Case studies demonstrate the development process, which begins with computational thinking and ends with effective and efficient parallel programs.
Audience
Advanced Students, Software engineers, Programmers, Hardware Engineers
Paperback, 514 Pages
Published: December 2012
Imprint: Morgan Kaufmann
ISBN: 978-0-12-415992-1
Reviews
-
"For those interested in the GPU path to parallel enlightenment, this new book from David Kirk and Wen-mei Hwu is a godsend, as it introduces CUDA (tm), a C-like data parallel language, and Tesla(tm), the architecture of the current generation of NVIDIA GPUs. In addition to explaining the language and the architecture, they define the nature of data parallel problems that run well on the heterogeneous CPU-GPU hardware ... This book is a valuable addition to the recently reinvigorated parallel computing literature."
- David Patterson, Director of The Parallel Computing Research Laboratory and the Pardee Professor of Computer Science, U.C. Berkeley. Co-author ofComputer Architecture: A Quantitative Approach "Written by two teaching pioneers, this book is the definitive practical reference on programming massively parallel processors--a true technological gold mine. The hands-on learning included is cutting-edge, yet very readable. This is a most rewarding read for students, engineers, and scientists interested in supercharging computational resources to solve today's and tomorrow's hardest problems."
- Nicolas Pinto, MIT, NVIDIA Fellow, 2009"I have always admired Wen-mei Hwu's and David Kirk's ability to turn complex problems into easy-to-comprehend concepts. They have done it again in this book. This joint venture of a passionate teacher and a GPU evangelizer tackles the trade-off between the simple explanation of the concepts and the in-depth analysis of the programming techniques. This is a great book to learn both massive parallel programming and CUDA."
- Mateo Valero, Director, Barcelona Supercomputing Center"The use of GPUs is having a big impact in scientific computing. David Kirk and Wen-mei Hwu's new book is an important contribution towards educating our students on the ideas and techniques of programming for massively parallel processors."
- Mike Giles, Professor of Scientific Computing, University of Oxford"This book is the most comprehensive and authoritative introduction to GPU computing yet. David Kirk and Wen-mei Hwu are the pioneers in this increasingly important field, and their insights are invaluable and fascinating. This book will be the standard reference for years to come."
- Hanspeter Pfister, Harvard University"This is a vital and much-needed text. GPU programming is growing by leaps and bounds. This new book will be very welcomed and highly useful across inter-disciplinary fields."
- Shannon Steinfadt, Kent State University"GPUs have hundreds of cores capable of delivering transformative performance increases across a wide range of computational challenges. The rise of these multi-core architectures has raised the need to teach advanced programmers a new and essential skill: how to program massively parallel processors." CNNMoney.com
Contents
Chapter 1: Introduction
1.1 GPUs as Parallel Computers
1.2 Architecture of a Modern GPU
1.3 Why More Speed or Parallelism?
1.4 Parallel Programming Languages and Models
1.5 Overarching Goals
1.6 Organization of the Book
Chapter 2: History of GPU Computing
2.1. Evolution of Graphics PipelinesThe Era of Fixed Function Graphics Pipeline
Evolution of Programmable Real-Time GraphicsUnified Graphics and Computing Processors
2.2. GPGPU: an Intermediate StepScalable GPUs
Recent DevelopmentsFuture Trends
Chapter 3: Introduction to CUDA3.1. Data Parallelism
3.2. CUDA Program Structure3.3. A Matrix-Matrix Multiplication Example
3.4. Device Memories and Data Transfer3.5. Kernel Functions and Threading
3.6. SummaryFunction Declarations
Kernel LaunchPredefined Variables
Runtime APIChapter 4: CUDA Threads
4.1. CUDA Thread Organization4.2. More on BlockIdx and ThreadIdx
4.3. Synchronization and Transparent Scalability4.4. Thread Assignment
4.5. Thread Scheduling and Latency Tolerance4.6. Summary
Chapter 5: CUDA Memories5.1. Importance of Memory Access Efficiency
5.2. CUDA Device Memory Types5.3. A Strategy for Reducing Global Memory Traffic
5.4. Memory as a Limiting Factor to Parallelism5.5. Summary
Chapter 6: Performance Considerations6.1. More on Thread Execution
6.2. Global Memory Bandwidth6.3. Dynamic Partitioning of SM Resources
6.4. Data Prefetching6.5. Instruction Mix
6.6. Thread Granularity6.7. Measured Performance and Summary
Chapter 7: Floating-Point Considerations7.1. Floating-Point Format
Normalized representation of MExcess encoding of E
7.2. Representable Numbers7.3. Special Bit Patterns and Precision
7.4. Arithmetic Accuracy and Rounding7.5. Algorithm Considerations
7.6. SummaryChapter 8: Application Case Study I Advanced MRI Reconstruction
8.1. Application Background8.2. Iterative Reconstruction
8.3. Computing FHdStep 1: Determine the Kernel Parallelism Structure
Step 2: Getting Around the Memory Bandwidth LimitationStep 3: Use Hardware Trigonometry Functions
Step 4: Experimental Performance Testing8.4. Final Evaluation
Chapter 9: Application Case Study II Molecular Visualization and Analysis9.1. Application Background
9.2. A Simple Kernel Implementation9.3. Instruction Execution Efficiency
9.4. Memory Coalescing9.5. Additional Performance Comparisons
9.6. Using Multiple GPUsChapter 10: Parallel Programming and Computational Thinking
10.1. Goals of Parallel Programming10.2. Problem Decomposition
10.3. Algorithm Selection10.4. Computational Thinking
Chapter 11: A Brief Introduction to OpenCL 11.1. Background
11.2. Data Parallelism Model11.3. Device Architecture
11.4. Kernel Functions11.5. Device Management and Kernel Launch
11.6. Electrostatic Potential Map in OpenCL11.7. Summary
Chapter 12: Conclusion and Future Outlook12.1. Goals Revisited
12.2. Memory Architecture Evolution12.3. Kernel Execution Control Evolution
12.4. Core Performance12.5. Programming Environment
12.6. A Bright OutlookAppendix A: Matrix Multiplication Example Code
Appendix B: Speed and feed of current generation CUDA devices
