GPU Computing Gems Emerald Edition

GPU Computing Gems Emerald Edition offers practical techniques in parallel computing using graphics processing units (GPUs) to enhance scientific research. The first volume in Morgan Kaufmann's Applications of GPU Computing Series, this book offers the latest insights and research in computer vision, electronic design automation, and emerging data-intensive applications. It also covers life sciences, medical imaging, ray tracing and rendering, scientific simulation, signal and audio processing, statistical modeling, video and image processing.

This book is intended to help those who are facing the challenge of programming systems to effectively use GPUs to achieve efficiency and performance goals. It offers developers a window into diverse application areas, and the opportunity to gain insights from others' algorithm work that they may apply to their own projects. Readers will learn from the leading researchers in parallel programming, who have gathered their solutions and experience in one volume under the guidance of expert area editors. Each chapter is written to be accessible to researchers from other domains, allowing knowledge to cross-pollinate across the GPU spectrum. Many examples leverage NVIDIA's CUDA parallel computing architecture, the most widely-adopted massively parallel programming solution. The insights and ideas as well as practical hands-on skills in the book can be immediately put to use.

Computer programmers, software engineers, hardware engineers, and computer science students will find this volume a helpful resource. For useful source codes discussed throughout the book, the editors invite readers to the following website: <a href="http://gpugems.hwu-server2.crhc.illinois.edu</a>…"

Editors, Reviewers, and Authors

Introduction

Chapter 1. GPU-Accelerated Computation and Interactive Display of Molecular Orbitals

1.1. Introduction, Problem Statement, and Context

1.2. Core Method

1.3. Algorithms, Implementations, and Evaluations

1.4. Final Evaluation

1.5. Future Directions

Chapter 2. Large-Scale Chemical Informatics on GPUs

2.1. Introduction, Problem Statement, and Context

2.2. Core Methods

2.3. Gaussian Shape Overlay: Parallelization and Arithmetic Optimization

2.4. LINGO: Algorithmic Transformation and Memory Optimization

2.5. Final Evaluation

2.6. Future Directions

Chapter 3. Dynamical Quadrature Grids

3.1. Introduction

3.2. Core Method

3.3. Implementation

3.4. Performance Improvement

3.5. Future Work

Chapter 4. Fast Molecular Electrostatics Algorithms on GPUs

4.1. Introduction, Problem Statement, and Context

4.2. Core Method

4.3. Algorithms, Implementations, and Evaluations

4.4. Final Evaluation

4.5. Future Directions

Chapter 5. Quantum Chemistry

5.1. Problem Statement

5.2. Core Technology and Algorithm

5.3. The Key Insight on the Implementation—the Choice of Building Blocks

5.4. Final Evaluation and Benefits

5.5. Conclusions and Future Directions

Chapter 6. An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm

6.1. Introduction, Problem Statement, and Context

6.2. Core Methods

6.3. Algorithms and Implementations

6.4. Evaluation and Validation of Results, Total Benefits, and Limitations

6.5. Future Directions

Chapter 7. Leveraging the Untapped Computation Power of GPUs

7.1. Background and Problem Statement

7.2. Flux Calculation and Aggregation

7.3. The GRASSY Platform

7.4. Initial Testing

7.5. Impact and Future Directions

Chapter 8. Black Hole Simulations with CUDA

8.1. Introduction

8.2. The Post-Newtonian Approximation

8.3. Numerical Algorithm

8.4. GPU Implementation

8.5. Performance Results

8.6. GPU Supercomputing Clusters

8.7. Statistical Results for Black Hole Inspirals

8.8. Conclusion

Chapter 9. Treecode and Fast Multipole Method for N-Body Simulation with CUDA

9.1. Introduction

9.2. Fast N-Body Simulation

9.3. CUDA Implementation of the Fast N-Body Algorithms

9.4. Improvements of Performance

9.5. Detailed Description of the GPU Kernels

9.6. Overview of Advanced Techniques

9.7. Conclusions

Chapter 10. Wavelet-Based Density Functional Theory Calculation on Massively Parallel Hybrid Architectures

10.1. Introduction, Problem Statement, and Context

10.2. Core Method

10.3. Algorithms, Implementations, and Evaluations

10.4. Final Evaluation and Validation of Results, Total Benefits, and Limitations

10.5. Conclusions and Future Directions

Introduction

Chapter 11. Accurate Scanning of Sequence Databases with the Smith-Waterman Algorithm

11.1. Introduction, Problem Statement, and Context

11.2. Core Method

11.3. CUDA implementation of the SW algorithm for identification of homologous proteins

11.4. Discussion

11.5. Final Evaluation

Chapter 12. Massive Parallel Computing to Accelerate Genome-Matching

12.1. Introduction, Problem Statement, and Context

12.2. Core Methods

12.3. Algorithms, Implementations, and Evaluations

12.4. Final Evaluation and Validation of Results, Total Benefits, and Limitations

12.5. Future Directions

Chapter 13. GPU-Supercomputer Acceleration of Pattern Matching

13.1. Introduction, Problem Statement, and Context

13.2. Core Method

13.3. Algorithms, Implementations, and Evaluations

13.4. Final Evaluation

13.5. Future Direction

Chapter 14. GPU Accelerated RNA Folding Algorithm

14.1. Problem Statement

14.2. Core Method

14.3. Algorithms, Implementations, and Evaluations

14.4. Final Evaluation

14.5. Future Directions

Chapter 15. Temporal Data Mining for Neuroscience

15.1. Introduction

15.2. Core Methodology

15.3. GPU Parallelization: Algorithms and Implementations

15.4. Experimental Results

15.5. Discussion

Introduction

Chapter 16. Parallelization Techniques for Random Number Generators

16.1. Introduction

16.2. L'Ecuyer's Multiple Recursive Generator MRG32k3a

16.3. Sobol Generator

16.4. Mersenne Twister MT19937

16.5. Performance Benchmarks

Chapter 17. Monte Carlo Photon Transport on the GPU

17.1. Physics of Photon Transport

17.2. Photon Transport on the GPU

17.3. The Complete System

17.4. Results and Evaluation

17.5. Future Directions

Chapter 18. High-Performance Iterated Function Systems

18.1. Problem Statement and Mathematical Background

18.2. Core Technology

18.3. Implementation

18.4. Final Evaluation

18.5. Conclusion

Introduction

Chapter 19. Large-Scale Machine Learning

19.1. Introduction

19.2. Core Technology

19.3. GPU Algorithm and Implementation

19.4. Improvements of Performance

19.5. Conclusions and Future Work

Chapter 20. Multiclass Support Vector Machine

20.1. Introduction, Problem Statement, and Context

20.2. Core Method

20.3. Algorithms, Implementations, and Evaluations

20.4. Final Evaluation

20.5. Future Direction

Chapter 21. Template-Driven Agent-Based Modeling and Simulation with CUDA

21.1. Introduction, Problem Statement, and Context

21.2. Final Evaluation and Validation of Results

21.3. Conclusions, Benefits and Limitations, and Future Work

Chapter 22. GPU-Accelerated Ant Colony Optimization

22.1. Introduction, Problem Statement, and Context

22.2. Core Method

22.3. Algorithms, Implementations, and Evaluations

22.4. Final Evaluation

22.5. Future Direction

Introduction

Chapter 23. High-Performance Gate-Level Simulation with GP-GPUs

23.1. Introduction

23.2. Simulator Overview

23.3. Compilation and Simulation

23.4. Experimental Results

23.5. Future Directions

Chapter 24. GPU-Based Parallel Computing for Fast Circuit Optimization

24.1. Introduction, Problem Statement, and Context

24.2. Core Method

24.3. Algorithms, Implementations, and Evaluations

24.4. Final Evaluation

24.5. Future Direction

Introduction

Chapter 25. Lattice Boltzmann Lighting Models

25.1. Introduction, Problem Statement, and Context

25.2. Core Methods

25.3. Algorithms, Implementation, and Evaluation

25.4. Final Evaluation

25.5. Future Directions

25.6. Derivation of the Diffusion Equation

Chapter 26. Path Regeneration for Random Walks

26.1. Introduction

26.2. Path Tracing as Case Study

26.3. Random Walks in Path Tracing

26.4. Implementation Details

26.5. Results

26.6. Discussion

Chapter 27. From Sparse Mocap to Highly Detailed Facial Animation

27.1. System Overview

27.2. Background

27.3. Core Technology and Algorithms

27.4. Future Directions

Chapter 28. A Programmable Graphics Pipeline in CUDA for Order-Independent Transparency

28.1. Introduction, Problem Statement, and Context

28.2. Core Method

28.3. Algorithms, Implementations, and Evaluations

28.4. Final Evaluation

28.5. Future Direction

Introduction

Chapter 29. Fast Graph Cuts for Computer Vision

29.1. Introduction, Problem Statement, and Context

29.2. Core Method

29.3. Algorithms, Implementations, and Evaluations

29.4. Final evaluation and validation of results

29.5. Multilabel Graph Cuts

Chapter 30. Visual Saliency Model on Multi-GPU

30.1. Introduction

30.2. Visual Saliency Model

30.3. GPU Implementation

30.4. Results

30.5. Conclusion

Chapter 31. Real-Time Stereo on GPGPU Using Progressive Multiresolution Adaptive Windows

31.1. Introduction, Problem Statement, and Context

31.2. Core Method

Chapter 32. Real-Time Speed-Limit-Sign Recognition on an Embedded System Using a GPU

32.1. Introduction

32.2. Methods

32.3. Implementation

32.4. Results and Discussion

32.5. Conclusion and Future Work

Chapter 33. Haar Classifiers for Object Detection with CUDA

33.1. Introduction

33.2. Viola-Jones Object Detection Retrospective

33.3. Object Detection Pipeline with NVIDIA CUDA

33.4. Benchmarking and Implementation Details

33.5. Future Direction

33.6. Conclusion

Introduction

Chapter 34. Experiences on Image and Video Processing with CUDA and OpenCL

34.1. Introduction, Problem Statement, and Background

34.2. Core Technology or Algorithm

34.3. Key Insights from Implementation and Evaluation

34.4. Final Evaluation

Chapter 35. Connected Component Labeling in CUDA

35.1. Introduction

35.2. Core Algorithm

35.3. CUDA Algorithm and Implementation

35.4. Final Evaluation and Results

Chapter 36. Image De-Mosaicing

36.1. Introduction, Problem Statement, and Context

36.2. Core Method

36.3. Algorithms, Implementations, and Evaluations

36.4. Final Evaluation

Introduction

Chapter 37. Efficient Automatic Speech Recognition on the GPU

37.1. Introduction, Problem Statement, and Context

37.2. Core Methods

37.3. Algorithms, Implementations, and Evaluations

37.4. Conclusion and Future Directions

Chapter 38. Parallel LDPC Decoding

38.1. Introduction, Problem Statement, and Context

38.2. Core Technology

38.3. Algorithms, Implementations, and Evaluations

38.4. Final Evaluation

38.5. Future Directions

Chapter 39. Large-Scale Fast Fourier Transform

39.1. Introduction

39.2. Memory Hierarchy of GPU Clusters

39.3. Large-Scale Fast Fourier Transform

39.4. Algebraic Manipulation of Array Dimensions

39.5. Performance Results

39.6. Conclusion and Future Work

Introduction

Chapter 40. GPU Acceleration of Iterative Digital Breast Tomosynthesis

40.1. Introduction

40.2. Digital Breast Tomosynthesis

40.3. Accelerating Iterative DBT using GPUs

40.4. Conclusions

Chapter 41. Parallelization of Katsevich CT Image Reconstruction Algorithm on Generic Multi-Core Processors and GPGPU

41.1. Introduction, Problem, and Context

41.2. Core Methods

41.3. Algorithms, Implementations, and Evaluations

41.4. Final Evaluation and Validation of Results, Total Benefits, and Limitations

41.5. Related Work

41.6. Future Directions

41.7. Summary

Chapter 42. 3-D Tomographic Image Reconstruction from Randomly Ordered Lines with CUDA

42.1. Introduction

42.2. Core Methods

42.3. Implementation

42.4. Evaluation and Validation of Results, Total Benefits, and Limitations

42.5. Future Directions

Chapter 43. Using GPUs to Learn Effective Parameter Settings for GPU-Accelerated Iterative CT Reconstruction Algorithms

43.1. Introduction, Problem Statement, and Context

43.2. Core Method(s)

43.3. Algorithms, Implementations, and Evaluations

43.4. Final Evaluation and Validation of Results, Total Benefits, and Limitations

43.5. Future Directions

Chapter 44. Using GPUs to Accelerate Advanced MRI Reconstruction with Field Inhomogeneity Compensation

44.1. Introduction

44.2. Core Method: Advanced Image Reconstruction Toolbox for MRI

44.3. MRI Reconstruction Algorithms and Implementation on GPUs

44.4. Final Results and Evaluation

44.5. Conclusion and Future Directions

Chapter 45. ℓ1 Minimization in ℓ1-SPIRiT Compressed Sensing MRI Reconstruction

45.1. Introduction, Problem Statement, and Context

45.2. Core Methods (High Level Description)

45.3. Algorithms, Implementations, and Evaluations (Detailed Description)

45.4. Final Evaluation and Validation of Results, Total Benefits, and Limitations

45.5. Discussion and Conclusion

Chapter 46. Medical Image Processing Using GPU-Accelerated ITK Image Filters

46.1. Introduction

46.2. Core Methods

46.3. Implementation

46.4. Results

46.5. Future Directions

46.6. Acknowledgments

Chapter 47. Deformable Volumetric Registration Using B-Splines

47.1. Introduction

47.2. An Overview of B-Spline Registration

47.3. Implementation Details

47.4. Results

47.5. Conclusions

Chapter 48. Multiscale Unbiased Diffeomorphic Atlas Construction on Multi-GPUs

48.1. Introduction, Problem Statement, and Context

48.2. Core Methods

48.3. Algorithms, Implementations, and Evaluations

48.4. Final Evaluation and Validation of Results, Total Benefits, and Limitations

48.5. Future Directions

Chapter 49. GPU-Accelerated Brain Connectivity Reconstruction and Visualization in Large-Scale Electron Micrographs

49.1. Introduction

49.2. Core Methods

49.3. Implementation

49.4. Results

49.5. Future Directions

Chapter 50. Fast Simulation of Radiographic Images Using a Monte Carlo X-Ray Transport Algorithm Implemented in CUDA

50.1. Introduction, Problem Statement, and Context

50.2. Core Methods

50.3. Algorithms, Implementations, and Evaluations

50.4. Final Evaluation and Validation of Results, Total Benefits, and Limitations

50.5. Future Directions

Index

Purchase options

Save 50% on book bundles

Institutional subscription on ScienceDirect

Resources

Wen-mei W. Hwu