Programming Massively Parallel Processors - 3rd Edition - ISBN: 9780128119860, 9780128119877

Programming Massively Parallel Processors

3rd Edition

A Hands-on Approach

Authors: David Kirk Wen-mei Hwu
eBook ISBN: 9780128119877
Paperback ISBN: 9780128119860
Imprint: Morgan Kaufmann
Published Date: 7th December 2016
Page Count: 576
Tax/VAT will be calculated at check-out Price includes VAT (GST)
30% off
30% off
30% off
30% off
30% off
20% off
20% off
30% off
30% off
30% off
30% off
30% off
20% off
20% off
30% off
30% off
30% off
30% off
30% off
20% off
20% off
30% off
30% off
30% off
30% off
30% off
20% off
20% off
79.95
55.97
55.97
55.97
55.97
55.97
63.96
63.96
48.99
34.29
34.29
34.29
34.29
34.29
39.19
39.19
57.95
40.56
40.56
40.56
40.56
40.56
46.36
46.36
86.32
60.42
60.42
60.42
60.42
60.42
69.06
69.06
Unavailable
Price includes VAT (GST)
DRM-Free

Easy - Download and start reading immediately. There’s no activation process to access eBooks; all eBooks are fully searchable, and enabled for copying, pasting, and printing.

Flexible - Read on multiple operating systems and devices. Easily read eBooks on smart phones, computers, or any eBook readers, including Kindle.

Open - Buy once, receive and download all available eBook formats, including PDF, EPUB, and Mobi (for Kindle).

Institutional Access

Secure Checkout

Personal information is secured with SSL technology.

Free Shipping

Free global shipping
No minimum order.

Description

Programming Massively Parallel Processors: A Hands-on Approach, Third Edition shows both student and professional alike the basic concepts of parallel programming and GPU architecture, exploring, in detail, various techniques for constructing parallel programs.

Case studies demonstrate the development process, detailing computational thinking and ending with effective and efficient parallel programs. Topics of performance, floating-point format, parallel patterns, and dynamic parallelism are covered in-depth.

For this new edition, the authors have updated their coverage of CUDA, including coverage of newer libraries, such as CuDNN, moved content that has become less important to appendices, added two new chapters on parallel patterns, and updated case studies to reflect current industry practices.

Key Features

  • Teaches computational thinking and problem-solving techniques that facilitate high-performance parallel computing
  • Utilizes CUDA version 7.5, NVIDIA's software development tool created specifically for massively parallel environments
  • Contains new and updated case studies
  • Includes coverage of newer libraries, such as CuDNN for Deep Learning

Readership

Advanced students, software engineers, programmers, hardware engineers

Table of Contents

  • Dedication
  • Preface
    • Target Audience
    • How to Use the Book
    • Illinois–NVIDIA GPU Teaching Kit
    • Online Supplements
  • Acknowledgements
  • Chapter 1. Introduction
    • Abstract
    • 1.1 Heterogeneous Parallel Computing
    • 1.2 Architecture of a Modern GPU
    • 1.3 Why More Speed or Parallelism?
    • 1.4 Speeding Up Real Applications
    • 1.5 Challenges in Parallel Programming
    • 1.6 Parallel Programming Languages and Models
    • 1.7 Overarching Goals
    • 1.8 Organization of the Book
    • References
  • Chapter 2. Data parallel computing
    • Abstract
    • 2.1 Data Parallelism
    • 2.2 CUDA C Program Structure
    • 2.3 A Vector Addition Kernel
    • 2.4 Device Global Memory and Data Transfer
    • 2.5 Kernel Functions and Threading
    • 2.6 Kernel Launch
    • 2.7 Summary
    • References
  • Chapter 3. Scalable parallel execution
    • Abstract
    • 3.1 CUDA Thread Organization
    • 3.2 Mapping Threads to Multidimensional Data
    • 3.3 Image Blur: A More Complex Kernel
    • 3.4 Synchronization and Transparent Scalability
    • 3.5 Resource Assignment
    • 3.6 Querying Device Properties
    • 3.7 Thread Scheduling and Latency Tolerance
    • 3.8 Summary
  • Chapter 4. Memory and data locality
    • Abstract
    • 4.1 Importance of Memory Access Efficiency
    • 4.2 Matrix Multiplication
    • 4.3 CUDA Memory Types
    • 4.4 Tiling for Reduced Memory Traffic
    • 4.5 A Tiled Matrix Multiplication Kernel
    • 4.6 Boundary Checks
    • 4.7 Memory as a Limiting Factor to Parallelism
    • 4.8 Summary
  • Chapter 5. Performance considerations
    • Abstract
    • 5.1 Global Memory Bandwidth
    • 5.2 More on Memory Parallelism
    • 5.3 Warps and SIMD Hardware
    • 5.4 Dynamic Partitioning of Resources
    • 5.5 Thread Granularity
    • 5.6 Summary
    • References
  • Chapter 6. Numerical considerations
    • Abstract
    • 6.1 Floating-Point Data Representation
    • 6.2 Representable Numbers
    • 6.3 Special Bit Patterns and Precision in IEEE Format
    • 6.4 Arithmetic Accuracy and Rounding
    • 6.5 Algorithm Considerations
    • 6.6 Linear Solvers and Numerical Stability
    • 6.7 Summary
    • References
  • Chapter 7. Parallel patterns: convolution: An introduction to stencil computation
    • Abstract
    • 7.1 Background
    • 7.2 1D Parallel Convolution—A Basic Algorithm
    • 7.3 Constant Memory and Caching
    • 7.4 Tiled 1D Convolution with Halo Cells
    • 7.5 A Simpler Tiled 1D Convolution—General Caching
    • 7.6 Tiled 2D Convolution With Halo Cells
    • 7.7 Summary
    • 7.8 Exercises
  • Chapter 8. Parallel patterns: prefix sum: An introduction to work efficiency in parallel algorithms
    • Abstract
    • 8.1 Background
    • 8.2 A Simple Parallel Scan
    • 8.3 Speed and Work Efficiency
    • 8.4 A More Work-Efficient Parallel Scan
    • 8.5 An Even More Work-Efficient Parallel Scan
    • 8.6 Hierarchical Parallel Scan for Arbitrary-Length Inputs
    • 8.7 Single-Pass Scan for Memory Access Efficiency
    • 8.8 Summary
    • 8.9 Exercises
    • References
  • Chapter 9. Parallel patterns—parallel histogram computation: An introduction to atomic operations and privatization
    • Abstract
    • 9.1 Background
    • 9.2 Use of Atomic Operations
    • 9.3 Block versus Interleaved Partitioning
    • 9.4 Latency versus Throughput of Atomic Operations
    • 9.5 Atomic Operation in Cache Memory
    • 9.6 Privatization
    • 9.7 Aggregation
    • 9.8 Summary
    • Reference
  • Chapter 10. Parallel patterns: sparse matrix computation: An introduction to data compression and regularization
    • Abstract
    • 10.1 Background
    • 10.2 Parallel SpMV Using CSR
    • 10.3 Padding and Transposition
    • 10.4 Using a Hybrid Approach to Regulate Padding
    • 10.5 Sorting and Partitioning for Regularization
    • 10.6 Summary
    • References
  • Chapter 11. Parallel patterns: merge sort: An introduction to tiling with dynamic input data identification
    • Abstract
    • 11.1 Background
    • 11.2 A Sequential Merge Algorithm
    • 11.3 A Parallelization Approach
    • 11.4 Co-Rank Function Implementation
    • 11.5 A Basic Parallel Merge Kernel
    • 11.6 A Tiled Merge Kernel
    • 11.7 A Circular-Buffer Merge Kernel
    • 11.8 Summary
    • Reference
  • Chapter 12. Parallel patterns: graph search
    • Abstract
    • 12.1 Background
    • 12.2 Breadth-First Search
    • 12.3 A Sequential BFS Function
    • 12.4 A Parallel BFS Function
    • 12.5 Optimizations
    • 12.6 Summary
    • References
  • Chapter 13. CUDA dynamic parallelism
    • Abstract
    • 13.1 Background
    • 13.2 Dynamic Parallelism Overview
    • 13.3 A Simple Example
    • 13.4 Memory Data Visibility
    • 13.5 Configurations and Memory Management
    • 13.6 Synchronization, Streams, and Events
    • 13.7 A More Complex Example
    • 13.8 A Recursive Example
    • 13.9 Summary
    • References
    • A13.1 Code Appendix
  • Chapter 14. Application case study—non-Cartesian magnetic resonance imaging: An introduction to statistical estimation methods
    • Abstract
    • 14.1 Background
    • 14.2 Iterative Reconstruction
    • 14.3 Computing FHD
    • 14.4 Final Evaluation
    • References
  • Chapter 15. Application case study—molecular visualization and analysis
    • Abstract
    • 15.1 Background
    • 15.2 A Simple Kernel Implementation
    • 15.3 Thread Granularity Adjustment
    • 15.4 Memory Coalescing
    • 15.5 Summary
    • References
  • Chapter 16. Application case study—machine learning
    • Abstract
    • 16.1 Background
    • 16.2 Convolutional Neural Networks
    • 16.3 Convolutional Layer: A Basic CUDA Implementation of Forward Propagation
    • 16.4 Reduction of Convolutional Layer to Matrix Multiplication
    • 16.5 cuDNN Library
    • References
  • Chapter 17. Parallel programming and computational thinking
    • Abstract
    • 17.1 Goals of Parallel Computing
    • 17.2 Problem Decomposition
    • 17.3 Algorithm Selection
    • 17.4 Computational Thinking
    • 17.5 Single Program, Multiple Data, Shared Memory and Locality
    • 17.6 Strategies for Computational Thinking
    • 17.7 A Hypothetical Example: Sodium Map of the Brain
    • 17.8 Summary
    • References
  • Chapter 18. Programming a heterogeneous computing cluster
    • Abstract
    • 18.1 Background
    • 18.2 A Running Example
    • 18.3 Message Passing Interface Basics
    • 18.4 Message Passing Interface Point-to-Point Communication
    • 18.5 Overlapping Computation and Communication
    • 18.6 Message Passing Interface Collective Communication
    • 18.7 CUDA-Aware Message Passing Interface
    • 18.8 Summary
    • Reference
  • Chapter 19. Parallel programming with OpenACC
    • Abstract
    • 19.1 The OpenACC Execution Model
    • 19.2 OpenACC Directive Format
    • 19.3 OpenACC by Example
    • 19.4 Comparing OpenACC and CUDA
    • 19.5 Interoperability with CUDA and Libraries
    • 19.6 The Future of OpenACC
  • Chapter 20. More on CUDA and graphics processing unit computing
    • Abstract
    • 20.1 Model of Host/Device Interaction
    • 20.2 Kernel Execution Control
    • 20.3 Memory Bandwidth and Compute Throughput
    • 20.4 Programming Environment
    • 20.5 Future Outlook
    • References
  • Chapter 21. Conclusion and outlook
    • Abstract
    • 21.1 Goals Revisited
    • 21.2 Future Outlook
  • Appendix A. An introduction to OpenCL
    • A.1 Background
    • A.2 Data Parallelism Model
    • A.3 Device Architecture
    • A.4 Kernel Functions
    • A.5 Device Management and Kernel Launch
    • A.6 Electrostatic Potential Map in OpenCL
    • A.7 Summary
  • Appendix B. THRUST: a productivity-oriented library for CUDA
    • B.1 Background
    • B.2 Motivation
    • B.3 Basic Thrust Features
    • B.4 Generic Programming
    • B.5 Benefits of Abstraction
    • B.6 Best Practices
  • Appendix C. CUDA Fortran
    • C.1 CUDA Fortran and CUDA C Differences
    • C.2 A First CUDA Fortran Program
    • C.3 Multidimensional Array in CUDA Fortran
    • C.4 Overloading Host/Device Routines with Generic Interfaces
    • C.5 Calling CUDA C via ISO_C_Binding
    • C.6 Kernel Loop Directives and Reduction Operations
    • C.7 Dynamic Shared Memory
    • C.8 Asynchronous Data Transfers
    • C.9 Compilation and Profiling
    • C.10 Calling Thrust from CUDA Fortran
  • Appendix D. An introduction to C++ AMP
    • D.1 Core C++ AMP Features
    • D.2 Details of the C++ AMP Execution Model
    • D.3 Managing Accelerators
    • D.4 Tiled Execution
    • D.5 C++ AMP Graphics Features
    • D.6 Summary
    • Reference
  • Index

Details

No. of pages:
576
Language:
English
Copyright:
© Morgan Kaufmann 2017
Published:
Imprint:
Morgan Kaufmann
eBook ISBN:
9780128119877
Paperback ISBN:
9780128119860

About the Author

David Kirk

David B. Kirk is well recognized for his contributions to graphics hardware and algorithm research. By the time he began his studies at Caltech, he had already earned B.S. and M.S. degrees in mechanical engineering from MIT and worked as an engineer for Raster Technologies and Hewlett-Packard's Apollo Systems Division, and after receiving his doctorate, he joined Crystal Dynamics, a video-game manufacturing company, as chief scientist and head of technology. In 1997, he took the position of Chief Scientist at NVIDIA, a leader in visual computing technologies, and he is currently an NVIDIA Fellow.

At NVIDIA, Kirk led graphics-technology development for some of today's most popular consumer-entertainment platforms, playing a key role in providing mass-market graphics capabilities previously available only on workstations costing hundreds of thousands of dollars. For his role in bringing high-performance graphics to personal computers, Kirk received the 2002 Computer Graphics Achievement Award from the Association for Computing Machinery and the Special Interest Group on Graphics and Interactive Technology (ACM SIGGRAPH) and, in 2006, was elected to the National Academy of Engineering, one of the highest professional distinctions for engineers.

Kirk holds 50 patents and patent applications relating to graphics design and has published more than 50 articles on graphics technology, won several best-paper awards, and edited the book Graphics Gems III. A technological "evangelist" who cares deeply about education, he has supported new curriculum initiatives at Caltech and has been a frequent university lecturer and conference keynote speaker worldwide.

Affiliations and Expertise

NVIDIA Fellow

Wen-mei Hwu

Wen-mei W. Hwu is a Professor and holds the Sanders-AMD Endowed Chair in the Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign. His research interests are in the area of architecture, implementation, compilation, and algorithms for parallel computing. He is the chief scientist of Parallel Computing Institute and director of the IMPACT research group (www.impact.crhc.illinois.edu). He is a co-founder and CTO of MulticoreWare. For his contributions in research and teaching, he received the ACM SigArch Maurice Wilkes Award, the ACM Grace Murray Hopper Award, the Tau Beta Pi Daniel C. Drucker Eminent Faculty Award, the ISCA Influential Paper Award, the IEEE Computer Society B. R. Rau Award and the Distinguished Alumni Award in Computer Science of the University of California, Berkeley. He is a fellow of IEEE and ACM. He directs the UIUC CUDA Center of Excellence and serves as one of the principal investigators of the NSF Blue Waters Petascale computer project. Dr. Hwu received his Ph.D. degree in Computer Science from the University of California, Berkeley.

Affiliations and Expertise

CTO, MulticoreWare and professor specializing in compiler design, computer architecture, microarchitecture, and parallel processing, University of Illinois at Urbana-Champaign