GPU Computing Gems Jade Edition

GPU Computing Gems, Jade Edition, offers hands-on, proven techniques for general purpose GPU programming based on the successful application experiences of leading researchers and developers. One of few resources available that distills the best practices of the community of CUDA programmers, this second edition contains 100% new material of interest across industry, including finance, medicine, imaging, engineering, gaming, environmental science, and green computing. It covers new tools and frameworks for productive GPU computing application development and provides immediate benefit to researchers developing improved programming environments for GPUs.

Divided into five sections, this book explains how GPU execution is achieved with algorithm implementation techniques and approaches to data structure layout. More specifically, it considers three general requirements: high level of parallelism, coherent memory access by threads within warps, and coherent control flow within warps. Chapters explore topics such as accelerating database searches; how to leverage the Fermi GPU architecture to further accelerate prefix operations; and GPU implementation of hash tables. There are also discussions on the state of GPU computing in interactive physics and artificial intelligence; programming tools and techniques for GPU computing; and the edge and node parallelism approach for computing graph centrality metrics. In addition, the book proposes an alternative approach that balances computation regardless of node degree variance.

Software engineers, programmers, hardware engineers, and advanced students will find this book extremely usefull. For useful source codes discussed throughout the book, the editors invite readers to the following website: <a href="http://gpugems.hwu-server2.crhc.illinois.edu</a>…"

Editors, Reviewers, and Authors

Editor-In-Chief

Managing Editor

NVIDIA Editor

Area Editors

Reviewers

Authors

Introduction

State of GPU Computing

Section 1: Parallel Algorithms and Data Structures

Introduction

In this Section

Chapter 1. Large-Scale GPU Search

1.1 Introduction

1.2 Memory Performance

1.3 Searching Large Data Sets

1.4 Experimental Evaluation

1.5 Conclusion

References

Chapter 2. Edge v. Node Parallelism for Graph Centrality Metrics

2.1 Introduction

2.2 Background

2.3 Node v. Edge Parallelism

2.4 Data Structure

2.5 Implementation

2.6 Analysis

2.7 Results

2.8 Conclusions

References

Chapter 3. Optimizing Parallel Prefix Operations for the Fermi Architecture

3.1 Introduction to Parallel Prefix Operations

3.2 Efficient Binary Prefix Operations on Fermi

3.3 Conclusion

References

Chapter 4. Building an Efficient Hash Table on the GPU

4.1 Introduction

4.2 Overview

4.3 Building and Querying a Basic Hash Table

4.4 Specializing the Hash Table

4.5 Analysis

4.6 Conclusion

Acknowledgments

References

Chapter 5. Efficient CUDA Algorithms for the Maximum Network Flow Problem

5.1 Introduction, Problem Statement, and Context

5.2 Core Method

5.3 Algorithms, Implementations, and Evaluations

5.4 Final Evaluation

5.5 Future Directions

References

Chapter 6. Optimizing Memory Access Patterns for Cellular Automata on GPUs

6.1 Introduction, Problem Statement, and Context

6.2 Core Methods

6.3 Algorithms, Implementations, and Evaluations

6.4 Final Results

6.5 Future Directions

References

Chapter 7. Fast Minimum Spanning Tree Computation

7.1 Introduction, Problem Statement, and Context

7.2 The MST Algorithm: Overview

7.3 CUDA Implementation of MST

7.4 Evaluation

7.5 Conclusions

References

Chapter 8. Comparison-Based In-Place Sorting with CUDA

8.1 Introduction

8.2 Bitonic Sort

8.3 Implementation

8.4 Evaluation

8.5 Conclusion

References

Section 2: Numerical Algorithms

Introduction

State of GPU-Based Numerical Algorithms

In this Section

Chapter 9. Interval Arithmetic in CUDA

9.1 Interval Arithmetic

9.2 Importance of Rounding Modes

9.3 Interval Operators in CUDA

9.4 Some Evaluations: Synthetic Benchmark

9.5 Application-Level Benchmark

9.6 Conclusion

References

Chapter 10. Approximating the erfinv Function

10.1 Introduction

10.2 New erfinv Approximations

10.3 Performance and Accuracy

10.4 Conclusions

References

Chapter 11. A Hybrid Method for Solving Tridiagonal Systems on the GPU

11.1 Introduction

11.3 Algorithms

11.4 Implementation

11.5 Results and Evaluation

11.6 Future Directions

Source code

References

Chapter 12. Accelerating CULA Linear Algebra Routines with Hybrid GPU and Multicore Computing

12.1 Introduction, Problem Statement, and Context

12.2 Core Methods

12.3 Algorithms, Implementations, and Evaluations

12.4 Final Evaluation and Validation]{Final Evaluation and Validation of Results, Total Benefits, and Limitations

12.5 Future Directions

References

Chapter 13. GPU Accelerated Derivative-Free Mesh Optimization

13.1 Introduction, Problem Statement, and Context

13.2 Core Method

13.3 Algorithms, Implementations, and Evaluations

13.4 Final Evaluation

13.5 Future Direction

References

Section 3: Engineering Simulation

Introduction

State of GPU Computing in Engineering Simulations

In this Section

Chapter 14. Large-Scale Gas Turbine Simulations on GPU Clusters

14.1 Introduction, Problem Statement, and Context

14.2 Core Method

14.3 Algorithms, Implementations, and Evaluations

14.4 Final Evaluation

14.5 Test Case and Parallel Performance

14.6 Future Directions

References

Chapter 15. GPU Acceleration of Rarefied Gas Dynamic Simulations

15.1 Introduction, Problem Statement, and Context

15.2 Core Methods

15.3 Algorithms, Implementations, and Evaluations

15.4 Final Evaluation

15.5 Future Directions

References

Chapter 16. Application of Assembly of Finite Element Methods on Graphics Processors for Real-Time Elastodynamics

16.1 Introduction, Problem Statement, and Context

16.2 Core Method

16.3 Algorithms, Implementations, and Evaluations

16.4 Evaluation and Validation of Results, Total Benefits, Limitations

16.5 Future Directions

Acknowledgments

References

Chapter 17. CUDA Implementation of Vertex-Centered, Finite Volume CFD Methods on Unstructured Grids with Flow Control Applications

17.1 Introduction, Problem Statement, and Context

17.2 Core (CFD and Optimization) Methods

17.3 Implementations and Evaluation

17.4 Applications to Flow Control — Optimization

References

Chapter 18. Solving Wave Equations on Unstructured Geometries

18.1 Introduction, Problem Statement, and Context

18.2 Core Method

18.3 Algorithms, Implementations, and Evaluations

18.4 Final Evaluation

18.5 Future Directions

Acknowledgments

References

Chapter 19. Fast Electromagnetic Integral Equation Solvers on Graphics Processing Units

19.1 Problem Statement and Background

19.2 Algorithms Introduction

19.3 Algorithm Description

19.4 GPU Implementations

19.5 Results

19.6 Integrating the GPU NGIM Algorithms with Iterative IE Solvers

19.7 Future directions

References

Section 4: Interactive Physics and AI for Games and Engineering Simulation

Introduction

State of GPU Computing in Interactive Physics and AI

In this Section

Chapter 20. Solving Large Multibody Dynamics Problems on the GPU

20.1 Introduction, Problem Statement, and Context

20.2 Core Method

20.3 The Time-Stepping Scheme

20.4 Algorithms, Implementations, and Evaluations

20.5 Final Evaluation

20.6 Future Directions

Acknowledgments

References

Chapter 21. Implicit FEM Solver on GPU for Interactive Deformation Simulation

21.1 Problem Statement and Context

21.2 Core Method

21.3 Algorithms and Implementations

21.4 Results and Evaluation

21.5 Future Directions

Acknowledgements

References

Chapter 22. Real-Time Adaptive GPU Multiagent Path Planning

22.1 Introduction

22.2 Core Method

22.3 Implementation

22.4 Results

References

Section 5: Computational Finance

Introduction

State of GPU Computing in Computational Finance

In this Section

Chapter 23. Pricing Financial Derivatives with High Performance Finite Difference Solvers on GPUs

23.1 Introduction, Problem Statement, and Context

23.2 Core Method

23.3 Algorithms, Implementations, and Evaluations

23.4 Final Evaluation

23.5 Future Directions

References

Chapter 24. Large-Scale Credit Risk Loss Simulation

24.1 Introduction, Problem Statement, and Context

24.2 Core Methods

24.3 Algorithms, Implementations, Evaluations

24.4 Results and Conclusions

24.5 Future Developments

Acknowledgements

References

Chapter 25. Monte Carlo–Based Financial Market Value-at-Risk Estimation on GPUs

25.1 Introduction, Problem Statement, and Context

25.2 Core Methods

25.3 Algorithms, Implementations, and Evaluations

25.4 Final Results

25.5 Conclusion

References

Section 6: Programming Tools and Techniques

Introduction

Programming Tools and Techniques for GPU Computing

In this Section

Chapter 26. Thrust: A Productivity-Oriented Library for CUDA

26.1 Motivation

26.2 Diving In

26.3 Generic Programming

26.4 Benefits of Abstraction

26.5 Best Practices

References

Chapter 27. GPU Scripting and Code Generation with PyCUDA

27.1 Introduction, Problem Statement, and Context

27.2 Core Method

27.3 Algorithms, Implementations, and Evaluations

27.4 Evaluation

27.5 Availability

27.6 Future Directions

Acknowledgment

References

Chapter 28. Jacket: GPU Powered MATLAB Acceleration

28.1 Introduction

28.2 Jacket

28.3 Benchmarking Procedures

28.4 Experimental Results

28.5 Future Directions

References

Chapter 29. Accelerating Development and Execution Speed with Just-in-Time GPU Code Generation

29.1 Introduction, Problem Statement, and Context

29.2 Core Methods

29.3 Algorithms, Implementations, and Evaluations

29.4 Final Evaluation

29.5 Future Directions

References

Chapter 30. GPU Application Development, Debugging, and Performance Tuning with GPU Ocelot

30.1 Introduction

30.2 Core Technology

30.3 Algorithm, Implementation, and Benefits

30.4 Future Directions

Acknowledgements

References

Chapter 31. Abstraction for AoS and SoA Layout in C++

31.1 Introduction, Problem Statement, and Context

31.2 Core Method

31.3 Implementation

31.4 ASA in Practice

31.5 Final Evaluation

Acknowledgments

References

Chapter 32. Processing Device Arrays with C++ Metaprogramming

32.1 Introduction, Problem Statement, and Context

32.2 Core Method

32.3 Implementation

32.4 Evaluation

32.5 Future Directions

References

Chapter 33. GPU Metaprogramming: A Case Study in Biologically Inspired Machine Vision

33.1 Introduction, Problem Statement, and Context

33.2 Core Method

33.3 Algorithms, Implementations, and Evaluations

33.4 Final Evaluation

33.5 Future Directions

References

Chapter 34. A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs

34.1 Introduction, Problem Statement, and Context

34.2 Core Method

34.3 Algorithms, Implementations, and Evaluations

34.4 Final Evaluation

34.5 Future Directions

References

Chapter 35. Dynamic Load Balancing Using Work-Stealing

35.1 Introduction

35.2 Core Method

35.3 Algorithms and Implementations

35.4 Case Studies and Evaluation

35.5 Future Directions

Acknowledgments

References

Chapter 36. Applying Software-Managed Caching and CPU/GPU Task Scheduling for Accelerating Dynamic Workloads

36.1 Introduction, Problem Statement, and Context

36.2 Core Method

36.3 Algorithms, Implementations, and Evaluations

36.4 Final Evaluation

References

Index

Purchase options

Save 50% on book bundles

Institutional subscription on ScienceDirect

Wen-mei W. Hwu