High Performance Parallelism Pearls Volume One - 1st Edition - ISBN: 9780128021187, 9780128021996

High Performance Parallelism Pearls Volume One

1st Edition

Multicore and Many-core Programming Approaches

Authors: James Reinders James Jeffers
eBook ISBN: 9780128021996
Paperback ISBN: 9780128021187
Imprint: Morgan Kaufmann
Published Date: 3rd November 2014
Page Count: 600
Tax/VAT will be calculated at check-out
File Compatibility per Device

PDF, EPUB, VSB (Vital Source):
PC, Apple Mac, iPhone, iPad, Android mobile devices.

Amazon Kindle eReader.

Institutional Access


High Performance Parallelism Pearls shows how to leverage parallelism on processors and coprocessors with the same programming – illustrating the most effective ways to better tap the computational potential of systems with Intel Xeon Phi coprocessors and Intel Xeon processors or other multicore processors. The book includes examples of successful programming efforts, drawn from across industries and domains such as chemistry, engineering, and environmental science. Each chapter in this edited work includes detailed explanations of the programming techniques used, while showing high performance results on both Intel Xeon Phi coprocessors and multicore processors. Learn from dozens of new examples and case studies illustrating "success stories" demonstrating not just the features of these powerful systems, but also how to leverage parallelism across these heterogeneous systems.

Key Features

  • Promotes consistent standards-based programming, showing in detail how to code for high performance on multicore processors and Intel® Xeon Phi™
  • Examples from multiple vertical domains illustrating parallel optimizations to modernize real-world codes
  • Source code available for download to facilitate further exploration



software engineers in high-performance computing and system developers in vertical domains hoping to leverage HPC

Table of Contents

  • Acknowledgments
  • Foreword
    • Humongous computing needs: Science years in the making
    • Open standards
    • Keen on many-core architecture
    • Xeon Phi is born: Many cores, excellent vector ISA
    • Learn highly scalable parallel programming
    • Future demands grow: Programming models matter
  • Preface
    • Inspired by 61 cores: A new era in programming
  • Chapter 1: Introduction
    • Abstract
    • Learning from successful experiences
    • Code modernization
    • Modernize with concurrent algorithms
    • Modernize with vectorization and data locality
    • Understanding power usage
    • ISPC and OpenCL anyone?
    • Intel Xeon Phi coprocessor specific
    • Many-core, neo-heterogeneous
    • No “Xeon Phi” in the title, neo-heterogeneous programming
    • The future of many-core
    • Downloads
  • Chapter 2: From “Correct” to “Correct & Efficient”: A Hydro2D Case Study with Godunov’s Scheme
    • Abstract
    • Scientific computing on contemporary computers
    • A numerical method for shock hydrodynamics
    • Features of modern architectures
    • Paths to performance
    • Summary
  • Chapter 3: Better Concurrency and SIMD on HBM
    • Abstract
    • The application: HIROMB-BOOS-Model
    • Key usage: DMI
    • HBM execution profile
    • Overview for the optimization of HBM
    • Data structures: Locality done right
    • Thread parallelism in HBM
    • Data parallelism: SIMD vectorization
    • Results
    • Profiling details
    • Scaling on processor vs. coprocessor
    • Contiguous attribute
    • Summary
  • Chapter 4: Optimizing for Reacting Navier-Stokes Equations
    • Abstract
    • Getting started
    • Version 1.0: Baseline
    • Version 2.0: ThreadBox
    • Version 3.0: Stack memory
    • Version 4.0: Blocking
    • Version 5.0: Vectorization
    • Intel Xeon Phi coprocessor results
    • Summary
  • Chapter 5: Plesiochronous Phasing Barriers
    • Abstract
    • What can be done to improve the code?
    • What more can be done to improve the code?
    • Hyper-Thread Phalanx
    • What is nonoptimal about this strategy?
    • Coding the Hyper-Thread Phalanx
    • Back to work
    • Data alignment
    • The plesiochronous phasing barrier
    • Let us do something to recover this wasted time
    • A few “left to the reader” possibilities
    • Xeon host performance improvements similar to Xeon Phi
    • Summary
  • Chapter 6: Parallel Evaluation of Fault Tree Expressions
    • Abstract
    • Motivation and background
    • Example implementation
    • Other considerations
    • Summary
  • Chapter 7: Deep-Learning Numerical Optimization
    • Abstract
    • Fitting an objective function
    • Objective functions and principle components analysis
    • Software and example data
    • Training data
    • Runtime results
    • Scaling results
    • Summary
  • Chapter 8: Optimizing Gather/Scatter Patterns
    • Abstract
    • Gather/scatter instructions in Intel® architecture
    • Gather/scatter patterns in molecular dynamics
    • Optimizing gather/scatter patterns
    • Summary
  • Chapter 9: A Many-Core Implementation of the Direct N-Body Problem
    • Abstract
    • N-Body simulations
    • Initial solution
    • Theoretical limit
    • Reduce the overheads, align your data
    • Optimize the memory hierarchy
    • Improving our tiling
    • What does all this mean to the host version?
    • Summary
  • Chapter 10: N-Body Methods
    • Abstract
    • Fast N-body methods and direct N-body kernels
    • Applications of N-body methods
    • Direct N-body code
    • Performance results
    • Summary
  • Chapter 11: Dynamic Load Balancing Using OpenMP 4.0
    • Abstract
    • Maximizing hardware usage
    • The N-Body kernel
    • The offloaded version
    • A first processor combined with coprocessor version
    • Version for processor with multiple coprocessors
  • Chapter 12: Concurrent Kernel Offloading
    • Abstract
    • Setting the context
    • Concurrent kernels on the coprocessor
    • Force computation in PD using concurrent kernel offloading
    • The bottom line
  • Chapter 13: Heterogeneous Computing with MPI
    • Abstract
    • Acknowledgments
    • MPI in the modern clusters
    • MPI task location
    • Selection of the DAPL providers
    • Summary
  • Chapter 14: Power Analysis on the Intel® Xeon Phi™ Coprocessor
    • Abstract
    • Power analysis 101
    • Measuring power and temperature with software
    • Hardware-based power analysis methods
    • Summary
  • Chapter 15: Integrating Intel Xeon Phi Coprocessors into a Cluster Environment
    • Abstract
    • Acknowledgments
    • Early explorations
    • Beacon system history
    • Beacon system architecture
    • Intel MPSS installation procedure
    • Setting up the resource and workload managers
    • Health checking and monitoring
    • Scripting common commands
    • User software environment
    • Future directions
    • Summary
  • Chapter 16: Supporting Cluster File Systems on Intel® Xeon Phi™ Coprocessors
    • Abstract
    • Network configuration concepts and goals
    • Coprocessor file systems support
    • Summary
  • Chapter 17: NWChem: Quantum Chemistry Simulations at Scale
    • Abstract
    • Introduction
    • Overview of single-reference CC formalism
    • NWChem software architecture
    • Engineering an offload solution
    • Offload architecture
    • Kernel optimizations
    • Performance evaluation
    • Summary
    • Acknowledgments
  • Chapter 18: Efficient Nested Parallelism on Large-Scale Systems
    • Abstract
    • Motivation
    • The benchmark
    • Baseline benchmarking
    • Pipeline approach—flat_arena class
    • Intel® TBB user-managed task arenas
    • Hierarchical approach—hierarchical_arena class
    • Performance evaluation
    • Implication on NUMA architectures
    • Summary
  • Chapter 19: Performance Optimization of Black-Scholes Pricing
    • Abstract
    • Financial market model basics and the Black-Scholes formula
    • Case study
    • Summary
  • Chapter 20: Data Transfer Using the Intel COI Library
    • Abstract
    • First steps with the Intel COI library
    • COI buffer types and transfer performance
    • Applications
    • Summary
  • Chapter 21: High-Performance Ray Tracing
    • Abstract
    • Background
    • Vectorizing ray traversal
    • The Embree ray tracing kernels
    • Using Embree in an application
    • Performance
    • Summary
  • Chapter 22: Portable Performance with OpenCL
    • Abstract
    • The dilemma
    • A brief introduction to OpenCL
    • A matrix multiply example in OpenCL
    • OpenCL and the Intel Xeon Phi Coprocessor
    • Matrix multiply performance results
    • Case study: Molecular docking
    • Results: Portable performance
    • Related work
    • Summary
  • Chapter 23: Characterization and Optimization Methodology Applied to Stencil Computations
    • Abstract
    • Introduction
    • Performance evaluation
    • Standard optimizations
    • Summary
  • Chapter 24: Profiling-Guided Optimization
    • Abstract
    • Matrix transposition in computer science
    • Tools and methods
    • “Serial”: Our original in-place transposition
    • “Parallel”: Adding parallelism with OpenMP
    • “Tiled”: Improving data locality
    • “Regularized”: Microkernel with multiversioning
    • “Planned”: Exposing more parallelism
    • Summary
  • Chapter 25: Heterogeneous MPI application optimization with ITAC
    • Abstract
    • Asian options pricing
    • Application design
    • Synchronization in heterogeneous clusters
    • Finding bottlenecks with ITAC
    • Setting up ITAC
    • Unbalanced MPI run
    • Manual workload balance
    • Dynamic “Boss-Workers” load balancing
    • Conclusion
  • Chapter 26: Scalable Out-of-Core Solvers on a Cluster
    • Abstract
    • Introduction
    • An OOC factorization based on ScaLAPACK
    • Porting from NVIDIA GPU to the Intel Xeon Phi coprocessor
    • Numerical results
    • Conclusions and future work
    • Acknowledgments
  • Chapter 27: Sparse Matrix-Vector Multiplication: Parallelization and Vectorization
    • Abstract
    • Acknowledgments
    • Background
    • Sparse matrix data structures
    • Parallel SpMV multiplication
    • Vectorization on the Intel Xeon Phi coprocessor
    • Evaluation
    • Summary
  • Chapter 28: Morton Order Improves Performance
    • Abstract
    • Improving cache locality by data ordering
    • Improving performance
    • Matrix transpose
    • Matrix multiply
    • Summary
  • Author Index
  • Subject Index


No. of pages:
© Morgan Kaufmann 2015
Morgan Kaufmann
eBook ISBN:
Paperback ISBN:

About the Author

James Reinders

James Reinders is a senior engineer who joined Intel Corporation in 1989 and has contributed to projects including the world’s first TeraFLOP supercomputer (ASCI Red), as well as compilers and architecture work for a number of Intel processors and parallel systems. James has been a driver behind the development of Intel as a major provider of software development products, and serves as their chief software evangelist. James has published numerous articles, contributed to several books and is widely interviewed on parallelism. James has managed software development groups, customer service and consulting teams, business development and marketing teams. James is sought after to keynote on parallel programming, and is the author/co-author of three books currently in print including Structured Parallel Programming, published by Morgan Kaufmann in 2012.

Affiliations and Expertise

Director and Programming Model Architect, Intel Corporation

James Jeffers

Jim Jeffers was the primary strategic planner and one of the first full-time employees on the program that became Intel ® MIC. He served as lead SW Engineering Manager on the program and formed and launched the SW development team. As the program evolved, he became the workloads (applications) and SW performance team manager. He has some of the deepest insight into the market, architecture and programming usages of the MIC product line. He has been a developer and development manager for embedded and high performance systems for close to 30 years.

Affiliations and Expertise

Principal Engineer and Visualization Lead, Intel Corporation


"This book will make it much easier in general to exploit high levels of parallelism including programming optimally for the Intel Xeon Phi products. The common programming methodology between the Xeon and Xeon Phi families is good news for the entire scientific and engineering community; the same programming can realize parallel scaling and vectorization for both multicore and many-core." –-from the Foreword by Sverre Jarp, CERN Openlab CTO