
High Performance Parallelism Pearls Volume One
Multicore and Many-core Programming Approaches
Resources
Description
Key Features
- Promotes consistent standards-based programming, showing in detail how to code for high performance on multicore processors and Intel® Xeon Phi™
- Examples from multiple vertical domains illustrating parallel optimizations to modernize real-world codes
- Source code available for download to facilitate further exploration
Readership
software engineers in high-performance computing and system developers in vertical domains hoping to leverage HPC
Table of Contents
Foreword
- Humongous computing needs: Science years in the making
- Open standards
- Keen on many-core architecture
- Xeon Phi is born: Many cores, excellent vector ISA
- Learn highly scalable parallel programming
- Future demands grow: Programming models matter
Preface
- Inspired by 61 cores: A new era in programming
Chapter 1: Introduction
- Abstract
- Learning from successful experiences
- Code modernization
- Modernize with concurrent algorithms
- Modernize with vectorization and data locality
- Understanding power usage
- ISPC and OpenCL anyone?
- Intel Xeon Phi coprocessor specific
- Many-core, neo-heterogeneous
- No “Xeon Phi” in the title, neo-heterogeneous programming
- The future of many-core
- Downloads
Chapter 2: From “Correct” to “Correct & Efficient”: A Hydro2D Case Study with Godunov’s Scheme
- Abstract
- Scientific computing on contemporary computers
- A numerical method for shock hydrodynamics
- Features of modern architectures
- Paths to performance
- Summary
Chapter 3: Better Concurrency and SIMD on HBM
- Abstract
- The application: HIROMB-BOOS-Model
- Key usage: DMI
- HBM execution profile
- Overview for the optimization of HBM
- Data structures: Locality done right
- Thread parallelism in HBM
- Data parallelism: SIMD vectorization
- Results
- Profiling details
- Scaling on processor vs. coprocessor
- Contiguous attribute
- Summary
Chapter 4: Optimizing for Reacting Navier-Stokes Equations
- Abstract
- Getting started
- Version 1.0: Baseline
- Version 2.0: ThreadBox
- Version 3.0: Stack memory
- Version 4.0: Blocking
- Version 5.0: Vectorization
- Intel Xeon Phi coprocessor results
- Summary
Chapter 5: Plesiochronous Phasing Barriers
- Abstract
- What can be done to improve the code?
- What more can be done to improve the code?
- Hyper-Thread Phalanx
- What is nonoptimal about this strategy?
- Coding the Hyper-Thread Phalanx
- Back to work
- Data alignment
- The plesiochronous phasing barrier
- Let us do something to recover this wasted time
- A few “left to the reader” possibilities
- Xeon host performance improvements similar to Xeon Phi
- Summary
Chapter 6: Parallel Evaluation of Fault Tree Expressions
- Abstract
- Motivation and background
- Example implementation
- Other considerations
- Summary
Chapter 7: Deep-Learning Numerical Optimization
- Abstract
- Fitting an objective function
- Objective functions and principle components analysis
- Software and example data
- Training data
- Runtime results
- Scaling results
- Summary
Chapter 8: Optimizing Gather/Scatter Patterns
- Abstract
- Gather/scatter instructions in Intel® architecture
- Gather/scatter patterns in molecular dynamics
- Optimizing gather/scatter patterns
- Summary
Chapter 9: A Many-Core Implementation of the Direct N-Body Problem
- Abstract
- N-Body simulations
- Initial solution
- Theoretical limit
- Reduce the overheads, align your data
- Optimize the memory hierarchy
- Improving our tiling
- What does all this mean to the host version?
- Summary
Chapter 10: N-Body Methods
- Abstract
- Fast N-body methods and direct N-body kernels
- Applications of N-body methods
- Direct N-body code
- Performance results
- Summary
Chapter 11: Dynamic Load Balancing Using OpenMP 4.0
- Abstract
- Maximizing hardware usage
- The N-Body kernel
- The offloaded version
- A first processor combined with coprocessor version
- Version for processor with multiple coprocessors
Chapter 12: Concurrent Kernel Offloading
- Abstract
- Setting the context
- Concurrent kernels on the coprocessor
- Force computation in PD using concurrent kernel offloading
- The bottom line
Chapter 13: Heterogeneous Computing with MPI
- Abstract
- Acknowledgments
- MPI in the modern clusters
- MPI task location
- Selection of the DAPL providers
- Summary
Chapter 14: Power Analysis on the Intel® Xeon Phi™ Coprocessor
- Abstract
- Power analysis 101
- Measuring power and temperature with software
- Hardware-based power analysis methods
- Summary
Chapter 15: Integrating Intel Xeon Phi Coprocessors into a Cluster Environment
- Abstract
- Acknowledgments
- Early explorations
- Beacon system history
- Beacon system architecture
- Intel MPSS installation procedure
- Setting up the resource and workload managers
- Health checking and monitoring
- Scripting common commands
- User software environment
- Future directions
- Summary
Chapter 16: Supporting Cluster File Systems on Intel® Xeon Phi™ Coprocessors
- Abstract
- Network configuration concepts and goals
- Coprocessor file systems support
- Summary
Chapter 17: NWChem: Quantum Chemistry Simulations at Scale
- Abstract
- Introduction
- Overview of single-reference CC formalism
- NWChem software architecture
- Engineering an offload solution
- Offload architecture
- Kernel optimizations
- Performance evaluation
- Summary
- Acknowledgments
Chapter 18: Efficient Nested Parallelism on Large-Scale Systems
- Abstract
- Motivation
- The benchmark
- Baseline benchmarking
- Pipeline approach—flat_arena class
- Intel® TBB user-managed task arenas
- Hierarchical approach—hierarchical_arena class
- Performance evaluation
- Implication on NUMA architectures
- Summary
Chapter 19: Performance Optimization of Black-Scholes Pricing
- Abstract
- Financial market model basics and the Black-Scholes formula
- Case study
- Summary
Chapter 20: Data Transfer Using the Intel COI Library
- Abstract
- First steps with the Intel COI library
- COI buffer types and transfer performance
- Applications
- Summary
Chapter 21: High-Performance Ray Tracing
- Abstract
- Background
- Vectorizing ray traversal
- The Embree ray tracing kernels
- Using Embree in an application
- Performance
- Summary
Chapter 22: Portable Performance with OpenCL
- Abstract
- The dilemma
- A brief introduction to OpenCL
- A matrix multiply example in OpenCL
- OpenCL and the Intel Xeon Phi Coprocessor
- Matrix multiply performance results
- Case study: Molecular docking
- Results: Portable performance
- Related work
- Summary
Chapter 23: Characterization and Optimization Methodology Applied to Stencil Computations
- Abstract
- Introduction
- Performance evaluation
- Standard optimizations
- Summary
Chapter 24: Profiling-Guided Optimization
- Abstract
- Matrix transposition in computer science
- Tools and methods
- “Serial”: Our original in-place transposition
- “Parallel”: Adding parallelism with OpenMP
- “Tiled”: Improving data locality
- “Regularized”: Microkernel with multiversioning
- “Planned”: Exposing more parallelism
- Summary
Chapter 25: Heterogeneous MPI application optimization with ITAC
- Abstract
- Asian options pricing
- Application design
- Synchronization in heterogeneous clusters
- Finding bottlenecks with ITAC
- Setting up ITAC
- Unbalanced MPI run
- Manual workload balance
- Dynamic “Boss-Workers” load balancing
- Conclusion
Chapter 26: Scalable Out-of-Core Solvers on a Cluster
- Abstract
- Introduction
- An OOC factorization based on ScaLAPACK
- Porting from NVIDIA GPU to the Intel Xeon Phi coprocessor
- Numerical results
- Conclusions and future work
- Acknowledgments
Chapter 27: Sparse Matrix-Vector Multiplication: Parallelization and Vectorization
- Abstract
- Acknowledgments
- Background
- Sparse matrix data structures
- Parallel SpMV multiplication
- Vectorization on the Intel Xeon Phi coprocessor
- Evaluation
- Summary
Chapter 28: Morton Order Improves Performance
- Abstract
- Improving cache locality by data ordering
- Improving performance
- Matrix transpose
- Matrix multiply
- Summary
Product details
- No. of pages: 600
- Language: English
- Copyright: © Morgan Kaufmann 2014
- Published: November 3, 2014
- Imprint: Morgan Kaufmann
- eBook ISBN: 9780128021996
- Paperback ISBN: 9780128021187
About the Authors
James Reinders

Affiliations and Expertise
James Jeffers

Affiliations and Expertise
Ratings and Reviews
There are currently no reviews for "High Performance Parallelism Pearls Volume One"