High Performance Parallelism Pearls Volume One

Multicore and Many-core Programming Approaches

1st Edition - November 3, 2014
Authors: James Reinders, James Jeffers
Language: English
Paperback ISBN:
9 7 8 - 0 - 1 2 - 8 0 2 1 1 8 - 7
eBook ISBN:
9 7 8 - 0 - 1 2 - 8 0 2 1 9 9 - 6

High Performance Parallelism Pearls shows how to leverage parallelism on processors and coprocessors with the same programming – illustrating the most effective ways to better ta… Read more

High Performance Parallelism Pearls Volume One

Purchase options

LIMITED OFFER

Save 50% on book bundles

Immediately download your ebook while waiting for your print delivery. No promo code is needed.

Institutional subscription on ScienceDirect

Request a sales quote

Resources

Textbook support for instructors(opens in new tab/window)

High Performance Parallelism Pearls shows how to leverage parallelism on processors and coprocessors with the same programming – illustrating the most effective ways to better tap the computational potential of systems with Intel Xeon Phi coprocessors and Intel Xeon processors or other multicore processors. The book includes examples of successful programming efforts, drawn from across industries and domains such as chemistry, engineering, and environmental science. Each chapter in this edited work includes detailed explanations of the programming techniques used, while showing high performance results on both Intel Xeon Phi coprocessors and multicore processors. Learn from dozens of new examples and case studies illustrating "success stories" demonstrating not just the features of these powerful systems, but also how to leverage parallelism across these heterogeneous systems.

Foreword

Humongous computing needs: Science years in the making
Open standards
Keen on many-core architecture
Xeon Phi is born: Many cores, excellent vector ISA
Learn highly scalable parallel programming
Future demands grow: Programming models matter

Preface

Inspired by 61 cores: A new era in programming

Chapter 1: Introduction

Abstract
Learning from successful experiences
Code modernization
Modernize with concurrent algorithms
Modernize with vectorization and data locality
Understanding power usage
ISPC and OpenCL anyone?
Intel Xeon Phi coprocessor specific
Many-core, neo-heterogeneous
No “Xeon Phi” in the title, neo-heterogeneous programming
The future of many-core
Downloads

Chapter 2: From “Correct” to “Correct & Efficient”: A Hydro2D Case Study with Godunov’s Scheme

Abstract
Scientific computing on contemporary computers
A numerical method for shock hydrodynamics
Features of modern architectures
Paths to performance
Summary

Chapter 3: Better Concurrency and SIMD on HBM

Abstract
The application: HIROMB-BOOS-Model
Key usage: DMI
HBM execution profile
Overview for the optimization of HBM
Data structures: Locality done right
Thread parallelism in HBM
Data parallelism: SIMD vectorization
Results
Profiling details
Scaling on processor vs. coprocessor
Contiguous attribute
Summary

Chapter 4: Optimizing for Reacting Navier-Stokes Equations

Abstract
Getting started
Version 1.0: Baseline
Version 2.0: ThreadBox
Version 3.0: Stack memory
Version 4.0: Blocking
Version 5.0: Vectorization
Intel Xeon Phi coprocessor results
Summary

Chapter 5: Plesiochronous Phasing Barriers

Abstract
What can be done to improve the code?
What more can be done to improve the code?
Hyper-Thread Phalanx
What is nonoptimal about this strategy?
Coding the Hyper-Thread Phalanx
Back to work
Data alignment
The plesiochronous phasing barrier
Let us do something to recover this wasted time
A few “left to the reader” possibilities
Xeon host performance improvements similar to Xeon Phi
Summary

Chapter 6: Parallel Evaluation of Fault Tree Expressions

Abstract
Motivation and background
Example implementation
Other considerations
Summary

Chapter 7: Deep-Learning Numerical Optimization

Abstract
Fitting an objective function
Objective functions and principle components analysis
Software and example data
Training data
Runtime results
Scaling results
Summary

Chapter 8: Optimizing Gather/Scatter Patterns

Abstract
Gather/scatter instructions in Intel® architecture
Gather/scatter patterns in molecular dynamics
Optimizing gather/scatter patterns
Summary

Chapter 9: A Many-Core Implementation of the Direct N-Body Problem

Abstract
N-Body simulations
Initial solution
Theoretical limit
Reduce the overheads, align your data
Optimize the memory hierarchy
Improving our tiling
What does all this mean to the host version?
Summary

Chapter 10: N-Body Methods

Abstract
Fast N-body methods and direct N-body kernels
Applications of N-body methods
Direct N-body code
Performance results
Summary

Chapter 11: Dynamic Load Balancing Using OpenMP 4.0

Abstract
Maximizing hardware usage
The N-Body kernel
The offloaded version
A first processor combined with coprocessor version
Version for processor with multiple coprocessors

Chapter 12: Concurrent Kernel Offloading

Abstract
Setting the context
Concurrent kernels on the coprocessor
Force computation in PD using concurrent kernel offloading
The bottom line

Chapter 13: Heterogeneous Computing with MPI

Abstract
Acknowledgments
MPI in the modern clusters
MPI task location
Selection of the DAPL providers
Summary

Chapter 14: Power Analysis on the Intel® Xeon Phi™ Coprocessor

Abstract
Power analysis 101
Measuring power and temperature with software
Hardware-based power analysis methods
Summary

Chapter 15: Integrating Intel Xeon Phi Coprocessors into a Cluster Environment

Abstract
Acknowledgments
Early explorations
Beacon system history
Beacon system architecture
Intel MPSS installation procedure
Setting up the resource and workload managers
Health checking and monitoring
Scripting common commands
User software environment
Future directions
Summary

Chapter 16: Supporting Cluster File Systems on Intel® Xeon Phi™ Coprocessors

Abstract
Network configuration concepts and goals
Coprocessor file systems support
Summary

Chapter 17: NWChem: Quantum Chemistry Simulations at Scale

Abstract
Introduction
Overview of single-reference CC formalism
NWChem software architecture
Engineering an offload solution
Offload architecture
Kernel optimizations
Performance evaluation
Summary
Acknowledgments

Chapter 18: Efficient Nested Parallelism on Large-Scale Systems

Abstract
Motivation
The benchmark
Baseline benchmarking
Pipeline approach—flat_arena class
Intel® TBB user-managed task arenas
Hierarchical approach—hierarchical_arena class
Performance evaluation
Implication on NUMA architectures
Summary

Chapter 19: Performance Optimization of Black-Scholes Pricing

Abstract
Financial market model basics and the Black-Scholes formula
Case study
Summary

Chapter 20: Data Transfer Using the Intel COI Library

Abstract
First steps with the Intel COI library
COI buffer types and transfer performance
Applications
Summary

Chapter 21: High-Performance Ray Tracing

Abstract
Background
Vectorizing ray traversal
The Embree ray tracing kernels
Using Embree in an application
Performance
Summary

Chapter 22: Portable Performance with OpenCL

Abstract
The dilemma
A brief introduction to OpenCL
A matrix multiply example in OpenCL
OpenCL and the Intel Xeon Phi Coprocessor
Matrix multiply performance results
Case study: Molecular docking
Results: Portable performance
Related work
Summary

Chapter 23: Characterization and Optimization Methodology Applied to Stencil Computations

Abstract
Introduction
Performance evaluation
Standard optimizations
Summary

Chapter 24: Profiling-Guided Optimization

Abstract
Matrix transposition in computer science
Tools and methods
“Serial”: Our original in-place transposition
“Parallel”: Adding parallelism with OpenMP
“Tiled”: Improving data locality
“Regularized”: Microkernel with multiversioning
“Planned”: Exposing more parallelism
Summary

Chapter 25: Heterogeneous MPI application optimization with ITAC

Abstract
Asian options pricing
Application design
Synchronization in heterogeneous clusters
Finding bottlenecks with ITAC
Setting up ITAC
Unbalanced MPI run
Manual workload balance
Dynamic “Boss-Workers” load balancing
Conclusion

Chapter 26: Scalable Out-of-Core Solvers on a Cluster

Abstract
Introduction
An OOC factorization based on ScaLAPACK
Porting from NVIDIA GPU to the Intel Xeon Phi coprocessor
Numerical results
Conclusions and future work
Acknowledgments

Chapter 27: Sparse Matrix-Vector Multiplication: Parallelization and Vectorization

Abstract
Acknowledgments
Background
Sparse matrix data structures
Parallel SpMV multiplication
Vectorization on the Intel Xeon Phi coprocessor
Evaluation
Summary

Chapter 28: Morton Order Improves Performance

Abstract
Improving cache locality by data ordering
Improving performance
Matrix transpose
Matrix multiply
Summary

Purchase options

Save 50% on book bundles

Institutional subscription on ScienceDirect

Resources

James Reinders

James Jeffers