GPU Computing Gems Emerald Edition - 1st Edition - ISBN: 9780123849885, 9780123849892

GPU Computing Gems Emerald Edition

1st Edition

Editor-in-Chiefs: Wen-mei Hwu
Hardcover ISBN: 9780123849885
eBook ISBN: 9780123849892
Imprint: Morgan Kaufmann
Published Date: 24th January 2011
Page Count: 886
Tax/VAT will be calculated at check-out
Compatible Not compatible
VitalSource PC, Mac, iPhone & iPad Amazon Kindle eReader
ePub & PDF Apple & PC desktop. Mobile devices (Apple & Android) Amazon Kindle eReader
Mobi Amazon Kindle eReader Anything else

Institutional Access

Table of Contents

Editors, Reviewers, and Authors



Chapter 1. GPU-Accelerated Computation and Interactive Display of Molecular Orbitals

1.1. Introduction, Problem Statement, and Context

1.2. Core Method

1.3. Algorithms, Implementations, and Evaluations

1.4. Final Evaluation

1.5. Future Directions

Chapter 2. Large-Scale Chemical Informatics on GPUs

2.1. Introduction, Problem Statement, and Context

2.2. Core Methods

2.3. Gaussian Shape Overlay: Parallelization and Arithmetic Optimization

2.4. LINGO: Algorithmic Transformation and Memory Optimization

2.5. Final Evaluation

2.6. Future Directions

Chapter 3. Dynamical Quadrature Grids

3.1. Introduction

3.2. Core Method

3.3. Implementation

3.4. Performance Improvement

3.5. Future Work

Chapter 4. Fast Molecular Electrostatics Algorithms on GPUs

4.1. Introduction, Problem Statement, and Context

4.2. Core Method

4.3. Algorithms, Implementations, and Evaluations

4.4. Final Evaluation

4.5. Future Directions

Chapter 5. Quantum Chemistry

5.1. Problem Statement

5.2. Core Technology and Algorithm

5.3. The Key Insight on the Implementation—the Choice of Building Blocks

5.4. Final Evaluation and Benefits

5.5. Conclusions and Future Directions

Chapter 6. An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm

6.1. Introduction, Problem Statement, and Context

6.2. Core Methods

6.3. Algorithms and Implementations

6.4. Evaluation and Validation of Results, Total Benefits, and Limitations

6.5. Future Directions

Chapter 7. Leveraging the Untapped Computation Power of GPUs

7.1. Background and Problem Statement

7.2. Flux Calculation and Aggregation

7.3. The GRASSY Platform

7.4. Initial Testing

7.5. Impact and Future Directions

Chapter 8. Black Hole Simulations with CUDA

8.1. Introduction

8.2. The Post-Newtonian Approximation

8.3. Numerical Algorithm

8.4. GPU Implementation

8.5. Performance Results

8.6. GPU Supercomputing Clusters

8.7. Statistical Results for Black Hole Inspirals

8.8. Conclusion

Chapter 9. Treecode and Fast Multipole Method for N-Body Simulation with CUDA

9.1. Introduction

9.2. Fast N-Body Simulation

9.3. CUDA Implementation of the Fast N-Body Algorithms

9.4. Improvements of Performance

9.5. Detailed Description of the GPU Kernels

9.6. Overview of Advanced Techniques

9.7. Conclusions

Chapter 10. Wavelet-Based Density Functional Theory Calculation on Massively Parallel Hybrid Architectures

10.1. Introduction, Problem Statement, and Context

10.2. Core Method

10.3. Algorithms, Implementations, and Evaluations

10.4. Final Evaluation and Validation of Results, Total Benefits, and Limitations

10.5. Conclusions and Future Directions


Chapter 11. Accurate Scanning of Sequence Databases with the Smith-Waterman Algorithm

11.1. Introduction, Problem Statement, and Context

11.2. Core Method

11.3. CUDA implementation of the SW algorithm for identification of homologous proteins

11.4. Discussion

11.5. Final Evaluation

Chapter 12. Massive Parallel Computing to Accelerate Genome-Matching

12.1. Introduction, Problem Statement, and Context

12.2. Core Methods

12.3. Algorithms, Implementations, and Evaluations

12.4. Final Evaluation and Validation of Results, Total Benefits, and Limitations

12.5. Future Directions

Chapter 13. GPU-Supercomputer Acceleration of Pattern Matching

13.1. Introduction, Problem Statement, and Context

13.2. Core Method

13.3. Algorithms, Implementations, and Evaluations

13.4. Final Evaluation

13.5. Future Direction

Chapter 14. GPU Accelerated RNA Folding Algorithm

14.1. Problem Statement

14.2. Core Method

14.3. Algorithms, Implementations, and Evaluations

14.4. Final Evaluation

14.5. Future Directions

Chapter 15. Temporal Data Mining for Neuroscience

15.1. Introduction

15.2. Core Methodology

15.3. GPU Parallelization: Algorithms and Implementations

15.4. Experimental Results

15.5. Discussion


Chapter 16. Parallelization Techniques for Random Number Generators

16.1. Introduction

16.2. L'Ecuyer's Multiple Recursive Generator MRG32k3a

16.3. Sobol Generator

16.4. Mersenne Twister MT19937

16.5. Performance Benchmarks

Chapter 17. Monte Carlo Photon Transport on the GPU

17.1. Physics of Photon Transport

17.2. Photon Transport on the GPU

17.3. The Complete System

17.4. Results and Evaluation

17.5. Future Directions

Chapter 18. High-Performance Iterated Function Systems

18.1. Problem Statement and Mathematical Background

18.2. Core Technology

18.3. Implementation

18.4. Final Evaluation

18.5. Conclusion


Chapter 19. Large-Scale Machine Learning

19.1. Introduction

19.2. Core Technology

19.3. GPU Algorithm and Implementation

19.4. Improvements of Performance

19.5. Conclusions and Future Work

Chapter 20. Multiclass Support Vector Machine

20.1. Introduction, Problem Statement, and Context

20.2. Core Method

20.3. Algorithms, Implementations, and Evaluations

20.4. Final Evaluation

20.5. Future Direction

Chapter 21. Template-Driven Agent-Based Modeling and Simulation with CUDA

21.1. Introduction, Problem Statement, and Context

21.2. Final Evaluation and Validation of Results

21.3. Conclusions, Benefits and Limitations, and Future Work

Chapter 22. GPU-Accelerated Ant Colony Optimization

22.1. Introduction, Problem Statement, and Context

22.2. Core Method

22.3. Algorithms, Implementations, and Evaluations

22.4. Final Evaluation

22.5. Future Direction


Chapter 23. High-Performance Gate-Level Simulation with GP-GPUs

23.1. Introduction

23.2. Simulator Overview

23.3. Compilation and Simulation

23.4. Experimental Results

23.5. Future Directions

Chapter 24. GPU-Based Parallel Computing for Fast Circuit Optimization

24.1. Introduction, Problem Statement, and Context

24.2. Core Method

24.3. Algorithms, Implementations, and Evaluations

24.4. Final Evaluation

24.5. Future Direction


Chapter 25. Lattice Boltzmann Lighting Models

25.1. Introduction, Problem Statement, and Context

25.2. Core Methods

25.3. Algorithms, Implementation, and Evaluation

25.4. Final Evaluation

25.5. Future Directions

25.6. Derivation of the Diffusion Equation

Chapter 26. Path Regeneration for Random Walks

26.1. Introduction

26.2. Path Tracing as Case Study

26.3. Random Walks in Path Tracing

26.4. Implementation Details

26.5. Results

26.6. Discussion

Chapter 27. From Sparse Mocap to Highly Detailed Facial Animation

27.1. System Overview

27.2. Background

27.3. Core Technology and Algorithms

27.4. Future Directions

Chapter 28. A Programmable Graphics Pipeline in CUDA for Order-Independent Transparency

28.1. Introduction, Problem Statement, and Context

28.2. Core Method

28.3. Algorithms, Implementations, and Evaluations

28.4. Final Evaluation

28.5. Future Direction


Chapter 29. Fast Graph Cuts for Computer Vision

29.1. Introduction, Problem Statement, and Context

29.2. Core Method

29.3. Algorithms, Implementations, and Evaluations

29.4. Final evaluation and validation of results

29.5. Multilabel Graph Cuts

Chapter 30. Visual Saliency Model on Multi-GPU

30.1. Introduction

30.2. Visual Saliency Model

30.3. GPU Implementation

30.4. Results

30.5. Conclusion

Chapter 31. Real-Time Stereo on GPGPU Using Progressive Multiresolution Adaptive Windows

31.1. Introduction, Problem Statement, and Context

31.2. Core Method

Chapter 32. Real-Time Speed-Limit-Sign Recognition on an Embedded System Using a GPU

32.1. Introduction

32.2. Methods

32.3. Implementation

32.4. Results and Discussion

32.5. Conclusion and Future Work

Chapter 33. Haar Classifiers for Object Detection with CUDA

33.1. Introduction

33.2. Viola-Jones Object Detection Retrospective

33.3. Object Detection Pipeline with NVIDIA CUDA

33.4. Benchmarking and Implementation Details

33.5. Future Direction

33.6. Conclusion


Chapter 34. Experiences on Image and Video Processing with CUDA and OpenCL

34.1. Introduction, Problem Statement, and Background

34.2. Core Technology or Algorithm

34.3. Key Insights from Implementation and Evaluation

34.4. Final Evaluation

Chapter 35. Connected Component Labeling in CUDA

35.1. Introduction

35.2. Core Algorithm

35.3. CUDA Algorithm and Implementation

35.4. Final Evaluation and Results

Chapter 36. Image De-Mosaicing

36.1. Introduction, Problem Statement, and Context

36.2. Core Method

36.3. Algorithms, Implementations, and Evaluations

36.4. Final Evaluation


Chapter 37. Efficient Automatic Speech Recognition on the GPU

37.1. Introduction, Problem Statement, and Context

37.2. Core Methods

37.3. Algorithms, Implementations, and Evaluations

37.4. Conclusion and Future Directions

Chapter 38. Parallel LDPC Decoding

38.1. Introduction, Problem Statement, and Context

38.2. Core Technology

38.3. Algorithms, Implementations, and Evaluations

38.4. Final Evaluation

38.5. Future Directions

Chapter 39. Large-Scale Fast Fourier Transform

39.1. Introduction

39.2. Memory Hierarchy of GPU Clusters

39.3. Large-Scale Fast Fourier Transform

39.4. Algebraic Manipulation of Array Dimensions

39.5. Performance Results

39.6. Conclusion and Future Work


Chapter 40. GPU Acceleration of Iterative Digital Breast Tomosynthesis

40.1. Introduction

40.2. Digital Breast Tomosynthesis

40.3. Accelerating Iterative DBT using GPUs

40.4. Conclusions

Chapter 41. Parallelization of Katsevich CT Image Reconstruction Algorithm on Generic Multi-Core Processors and GPGPU

41.1. Introduction, Problem, and Context

41.2. Core Methods

41.3. Algorithms, Implementations, and Evaluations

41.4. Final Evaluation and Validation of Results, Total Benefits, and Limitations

41.5. Related Work

41.6. Future Directions

41.7. Summary

Chapter 42. 3-D Tomographic Image Reconstruction from Randomly Ordered Lines with CUDA

42.1. Introduction

42.2. Core Methods

42.3. Implementation

42.4. Evaluation and Validation of Results, Total Benefits, and Limitations

42.5. Future Directions

Chapter 43. Using GPUs to Learn Effective Parameter Settings for GPU-Accelerated Iterative CT Reconstruction Algorithms

43.1. Introduction, Problem Statement, and Context

43.2. Core Method(s)

43.3. Algorithms, Implementations, and Evaluations

43.4. Final Evaluation and Validation of Results, Total Benefits, and Limitations

43.5. Future Directions

Chapter 44. Using GPUs to Accelerate Advanced MRI Reconstruction with Field Inhomogeneity Compensation

44.1. Introduction

44.2. Core Method: Advanced Image Reconstruction Toolbox for MRI

44.3. MRI Reconstruction Algorithms and Implementation on GPUs

44.4. Final Results and Evaluation

44.5. Conclusion and Future Directions

Chapter 45. ℓ1 Minimization in ℓ1-SPIRiT Compressed Sensing MRI Reconstruction

45.1. Introduction, Problem Statement, and Context

45.2. Core Methods (High Level Description)

45.3. Algorithms, Implementations, and Evaluations (Detailed Description)

45.4. Final Evaluation and Validation of Results, Total Benefits, and Limitations

45.5. Discussion and Conclusion

Chapter 46. Medical Image Processing Using GPU-Accelerated ITK Image Filters

46.1. Introduction

46.2. Core Methods

46.3. Implementation

46.4. Results

46.5. Future Directions

46.6. Acknowledgments

Chapter 47. Deformable Volumetric Registration Using B-Splines

47.1. Introduction

47.2. An Overview of B-Spline Registration

47.3. Implementation Details

47.4. Results

47.5. Conclusions

Chapter 48. Multiscale Unbiased Diffeomorphic Atlas Construction on Multi-GPUs

48.1. Introduction, Problem Statement, and Context

48.2. Core Methods

48.3. Algorithms, Implementations, and Evaluations

48.4. Final Evaluation and Validation of Results, Total Benefits, and Limitations

48.5. Future Directions

Chapter 49. GPU-Accelerated Brain Connectivity Reconstruction and Visualization in Large-Scale Electron Micrographs

49.1. Introduction

49.2. Core Methods

49.3. Implementation

49.4. Results

49.5. Future Directions

Chapter 50. Fast Simulation of Radiographic Images Using a Monte Carlo X-Ray Transport Algorithm Implemented in CUDA

50.1. Introduction, Problem Statement, and Context

50.2. Core Methods

50.3. Algorithms, Implementations, and Evaluations

50.4. Final Evaluation and Validation of Results, Total Benefits, and Limitations

50.5. Future Directions



GPU Computing Gems Emerald Edition offers practical techniques in parallel computing using graphics processing units (GPUs) to enhance scientific research. The first volume in Morgan Kaufmann's Applications of GPU Computing Series, this book offers the latest insights and research in computer vision, electronic design automation, and emerging data-intensive applications. It also covers life sciences, medical imaging, ray tracing and rendering, scientific simulation, signal and audio processing, statistical modeling, video and image processing.

This book is intended to help those who are facing the challenge of programming systems to effectively use GPUs to achieve efficiency and performance goals. It offers developers a window into diverse application areas, and the opportunity to gain insights from others' algorithm work that they may apply to their own projects. Readers will learn from the leading researchers in parallel programming, who have gathered their solutions and experience in one volume under the guidance of expert area editors. Each chapter is written to be accessible to researchers from other domains, allowing knowledge to cross-pollinate across the GPU spectrum. Many examples leverage NVIDIA's CUDA parallel computing architecture, the most widely-adopted massively parallel programming solution. The insights and ideas as well as practical hands-on skills in the book can be immediately put to use.

Computer programmers, software engineers, hardware engineers, and computer science students will find this volume a helpful resource.

Key Features

  • Covers the breadth of industry from scientific simulation and electronic design automation to audio / video processing, medical imaging, computer vision, and more
  • Many examples leverage NVIDIA's CUDA parallel computing architecture, the most widely-adopted massively parallel programming solution
  • Offers insights and ideas as well as practical "hands-on" skills you can immediately put to use


computer programmers, software engineers, hardware engineers, computer science students


No. of pages:
© Morgan Kaufmann 2011
Morgan Kaufmann
eBook ISBN:
Hardcover ISBN:


Praise for GPU Computing Gems: Emerald Edition:
"GPU computing is becoming an outstanding field in high performance computing. Due to its easiness, the CUDA approach enables programmers to take advantage of GPU-acceleration very quickly… My research in complex science as well as applications in high frequency trading benefited significantly from GPU computing." --Dr. Tobias Preis, ETH Zurich, Switzerland

"This book is an important reference for everyone working on GPU/CUDA, and contains definitive work in a selection of fields. The patterns of CUDA parallelization it describes can often be adapted to applications in other fields." --Dr. Ming Ouyang, Assistant Professor – Director Visualization and Intensive Graphics Lab, University of Louisville

"Diving into the world of GPU computing has never been more important these days. GPU Computing Gems: Emerald Edition takes you through the looking glass into this fascinating world." --Martin Eisemann, Computer Graphics Lab, TU Braunschweig

"…an outstanding collection of vignettes of how to program GPUs for a breathtaking range of applications." --Dr. Amitabh Varshney, Director, Institute for Advanced Computer Studies, University of Maryland

"The book features a useful index that might help readers mine the gems in search of a solution to a specific algorithmic problem. The index is accompanied by online resources containing source code samples—and further information—for some of the chapters. A second volume with another 30 chapters of GPGPU application reports, somewhat more focused on generic algorithms and programming techniques, is currently in the pipeline and scheduled to appear as the "Jade Edition" sometime this month." --Computing in Science and Engineering

"The book is an excellent selection of important papers describing various applications of GPUs. As such, I believe it would be a valuable addition to the bookshelf of any researcher in modeling and simulation…This is not a substitute for a more detailed text on massively parallel programming...Instead, it is a nice practical addition to that text." --Computing Reviews, August 2012

"...the perfect companion to Programming Massively Parallel Processors by Hwu & Kirk." -Nicolas Pinto, Research Scientist at Harvard & MIT, NVIDIA Fellow 2009-2010

About the Editor-in-Chiefs

Wen-mei Hwu Editor-in-Chief

Wen-mei W. Hwu is a Professor and holds the Sanders-AMD Endowed Chair in the Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign. His research interests are in the area of architecture, implementation, compilation, and algorithms for parallel computing. He is the chief scientist of Parallel Computing Institute and director of the IMPACT research group ( He is a co-founder and CTO of MulticoreWare. For his contributions in research and teaching, he received the ACM SigArch Maurice Wilkes Award, the ACM Grace Murray Hopper Award, the Tau Beta Pi Daniel C. Drucker Eminent Faculty Award, the ISCA Influential Paper Award, the IEEE Computer Society B. R. Rau Award and the Distinguished Alumni Award in Computer Science of the University of California, Berkeley. He is a fellow of IEEE and ACM. He directs the UIUC CUDA Center of Excellence and serves as one of the principal investigators of the NSF Blue Waters Petascale computer project. Dr. Hwu received his Ph.D. degree in Computer Science from the University of California, Berkeley.

Affiliations and Expertise

CTO, MulticoreWare and professor specializing in compiler design, computer architecture, microarchitecture, and parallel processing, University of Illinois at Urbana-Champaign